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Q , Abstract 

o ; 

| We consider the problem of minimizing upper bounds and maximizing lower bounds on information rates of 

stationary and ergodic discrete-time channels with memory. The channels we consider can have a finite number 
of states, such as partial response channels, or they can have an infinite state-space, such as time-varying fading 
channels. 

00 ' 

^vq ' We optimize recently-proposed information rate bounds for such channels, which make use of auxiliary 

finite-state machine channels (FSMCs). Our main contribution in this paper is to provide iterative expectation- 
maximization (EM) type algorithms to optimize the parameters of the auxiliary FSMC to tighten these bounds. 
We provide an explicit, iterative algorithm that improves the upper bound at each iteration. We also provide an 
effective method for iteratively optimizing the lower bound. To demonstrate the effectiveness of our algorithms, 
we provide several examples of partial response and fading channels, where the proposed optimization techniques 
significantly tighten the initial upper and lower bounds. Finally, we compare our results with an improved variation 
of the simplex local optimization algorithm, called Soblex. This comparison shows that our proposed algorithms 
are superior to the Soblex method, both in terms of robustness in finding the tightest bounds and in computational 

^J. , efficiency. 

Interestingly, from a channel coding/decoding perspective, optimizing the lower bound is related to increasing 
the achievable mismatched information rate, i.e., the information rate of a communication system where the 
^P. ' decoder at the receiver is matched to the auxiliary channel, and not to the original channel. 
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I. Introduction 

A. Motivation and Background 

Channels with memory are common in practical communication systems. The partial response channel is an 
important example of a channel with memory with applications in magnetic and optical recording, as well as in 
communications over band-limited channels with inter-symbol interference (ISI) [1]. The time-varying multipath 
fading channel in wireless communication systems is another example of a channel with memory [2]. Although 
the information rate of such channels is formulated (we assume them to be stationary and ergodic), the direct 
computation of the information rate (under the assumption of no channel state information (CSI) at the receiver 
or the transmitter) has remained an open problem [3]. 

Partial response channels can be closely modeled by finite-state machine channels (FSMCs) [4]. Stochastic 
and numerical strategies have been proposed in the literature for efficient computation of information rates 
for FSMCs [3], [5]-[7]. The numerical estimate of the information rate converges under mild conditions with 
probability one to the true value when the length of the channel input and output sequences goes to infinity. Upper 
and lower bounds on the capacity of FSMCs have been proposed in [8]— [10], where the upper bound in [8] is 
based on Lagrange duality, the upper bound in [9] is based on the FSMC feedback capacity, and the lower bound 
in [10] is obtained by numerically optimizing the parameters of the Markov input source to the channel. From 
a practical viewpoint, information rates, the capacity, and the capacity-achieving input distribution of FSMCs 
with not too many states is numerically computable. However, for more complex partial response channels with 
longer memories, the large number of states in the FSMC prohibits efficient computation of information rates. 
Physical multipath fading channels are channels with an infinite (continuous-valued) state space. Therefore, direct 
application of the techniques in [3], [5]-[7] for computing information rates for fading channels is not possible. 

For the case of stationary and ergodic channels with memory that are non-fmite-state or where the number of 
states is large, we would still like to efficiently (stochastically) compute upper and lower bounds on the information 
rate. Such upper and lower bounds were proposed in [3], [11] based on the introduction of an auxiliary FSMC. 
The bounds are generally applicable to finite-state and non-fmite-state channels with memory that are stationary 
and ergodic. The lower bound in [3], [11] is a special case of the generalized mutual information (GMI) lower 
bound for mismatched decoding [12]. In other words, the lower bound signifies achievable information rates 
when the receiver is equipped with a decoding algorithm which is matched to the auxiliary channel model and 
hence, usually mismatched to the original channel (over which the actual communication takes place). 

The maximum number of states and branches in the auxiliary FSMC model is dictated by the computational 
complexity of running the Bahl-Cocke-Jelinek-Raviv (BCJR) algorithm [13] or the mismatched decoding budget. 
The number of auxiliary FSMC states and branches, along with the FSMC state transition and output probabilities, 
affect the tightness of the bounds. Therefore, for a given auxiliary FSMC trellis section, it is desirable to optimize 
the remaining auxiliary FSMC parameters to obtain the tightest information rate upper and lower bounds, which 
is the topic of this paper. 



B. Contributions and Organization 

In this paper, we optimize the parameters of the auxiliary FSMC model in order to tighten the information 
rate upper and lower bounds that were introduced in [3], [11] for channels with memory. Actually, the lower 
bounds that we use in the present paper are slightly more general than the lower bounds in [3], [11]. They are 
also more general than the lower bounds that were optimized in the preliminary version [14] of this paper. 

The original and auxiliary channels are reviewed in Section [TT] and the information rate bounds under consid- 
eration are reviewed in Section [Till In our approach, we assume that we are given a fixed trellis section of an 
auxiliary FSMC and that we optimize its remaining parameters (the state transition and output probabilities). The 
general optimization idea is shown in Section [TV] Briefly speaking, we replace the optimization of a function by a 
succession of surrogate-function optimizations, an approach that can be seen as as a variation of the expectation- 
maximization (EM) algorithm [15]. 

Five main contributions of the paper are summarized as follows. 

1) In Section [Vl we propose an iterative procedure for the minimization of the upper bound. For this purpose, 
we devise an easily optimizable surrogate function. We establish that this surrogate function is never below 
the original upper bound and that by minimizing it, we ensure non-increasing upper bounds in each iteration. 

2) In Section |VlJ we propose a similar iterative procedure for the minimization of the difference between the 
upper bound and a specialized version of the lower bound. The parameters of the auxiliary FSMC that 
minimize this difference can be used as the initial point for the optimization of upper and lower bounds, 
resulting in quicker convergence or tighter bounds. 

3) In Section I VII I we propose an iterative procedure for the maximization of the lower bound by devising an 
easily optimizable surrogate function. The important property of this surrogate function is that at the current 
point in the auxiliary FSMC parameter space, the function value and its gradient agree with the lower bound 
function value and its gradient, respectively. However, we were not able to establish analytically that this 
surrogate function is never above the lower bound, which makes the approach rather heuristic. Note though 
that our surrogate function of choice will have a parameter which enables one to control the "aggressiveness" 
of the optimization step. Adaptively setting this parameter allows one to have a non-decreasing lower bound 
after every step. 

4) In Section IVIII1 we apply our optimization techniques to several partial response channels and observe 
that they result in noticeably tighter upper and lower bounds. We analyze the convergence properties and 
the numerical tolerance of the proposed algorithms. Moreover, we compare our optimization results with 
those obtained using an improved version of the simplex local optimization algorithm [16]. The improved 
simplex method is called Soblex and was recently proposed in [17]. This comparison shows the superiority 
of our algorithms in terms of computational efficiency, tightness, and reliability of the optimized bounds. 

5) In Section Hx] we apply our optimization techniques to time-varying fading channels. We also propose 
a tight lower bound for the conditional entropy of the original Gauss-Markov fading channel, which is 
required for computing the upper bound. Compared to the widely used perfect CSI upper bound, we have 
obtained significantly tighter upper bounds, which together with the optimized lower bound, successfully 
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bound the range of fading channel information rates. 
Some of the proofs have been relegated to the appendices at the end of the paper. 

C. Notations 

The following general notations will be used. Other special notations will be introduced later in the paper. 

• Alphabet sets will be denoted by calligraphic characters, such as A. 

• Random variables will be denoted by upper-case characters {e.g. X), while their realizations will be denoted 
by lower-case characters {e.g. x). 

• Random vectors will be denoted by upper-case boldface characters {e.g. X), while their realizations will be 
denoted by lower-case boldface characters {e.g. x). 

• Sequences like . . . ,x-i,xo,x%, . . . will be denoted by {xf\. If j > i, we will use the short-hand to 
denote the vector [xi,Xi + ±, . . . , Xj-i,Xj]. 

• All logarithms are natural logarithms (base e); therefore, all entropies and mutual informations will be 
measured in nats. The only exceptions are figures and their corresponding discussions, where the information 
rates will be presented in bits per channel use (bits / channel use). 

• All channel input and output alphabets are assumed to be finite. Unless stated otherwise, we only deal with 
probability mass functions (pmfs) and conditional probability mass functions in this paper. 

• In order to avoid cluttering the summation signs, we will use the following conventions. Summations like 
E*. Ey E s > and Efe win implicitly mean J2xex> Eyey E se «s> and EbeB* respectively. Summations 
like ]T X and Yl y will implicitly mean Ylxex N and Eyey™ ■ Summations like V\ and ^ b will be 
over all valid channel state sequences and valid channel branch sequences, respectively^ 

II. Source and Channel Models 
Before presenting source and channel definitions, it is useful to define the following index set. 

Definition 1 (The Index Set) We assume N to be a positive integer. We define the following index set 

3-N — [-N + 1, N] = {—N + 1, . . . , iV}. 

Observe that the size of the set is \Xn \ = 2N. Note that in our results, we are mainly interested in the limit 

N -> oo □ 

Consider the block diagram in Fig. Q}a) which shows a source and a channel, in the following also called the 
original forward channel. We assume that both the source and the channel are fixed and that we would like to find 
tight upper and lower bounds on the information rate of this source/channel pair. Our upper bounding technique 
will use a so-called auxiliary forward channel (see Fig. (He)) whose input/output alphabets match the input/output 
alphabets of the original forward channel. The tightening of the upper bound will be achieved by optimizing 
the parameters of the auxiliary forward channel. Similarly, our lower bounding technique will use a so-called 

'The notion of "valid channel state sequences" and "valid channel branch sequences" will be introduced later on. 
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Fig. 1. Block diagrams of (a) the source and the original (forward) channel, (b) the "output source" and the original backward channel, 
(c) the auxiliary forward channel, and (d) the auxiliary backward channel under consideration. 

auxiliary backward channel (see Fig. (Hd)) whose input/output alphabets match the input/output alphabets of the 
original backward channel (see Fig. [Ub)) that is associated to the original forward channel. The tightening of 
the lower bound will be achieved by optimizing the parameters of the auxiliary backward channel. 

A. Source Model 

Definition 2 (Source) In this paper we only consider sources that are discrete-time, stationary, and ergodic and 
that produce a sequence . . . , X—\,Xq,X\, . . ., where Xi G X for all I G Z. We assume that the alphabet X 
is finite and that for any i < j the probability of observing is P x j(x|). In the following, we will use the 
abbreviation Q(xj) = P x j (x^) for any i < j, □ 

B. Finite-State Machine Channels (FSMCs) 

Many of the original channels that we will consider can be described as FSMCs@ 

Definition 3 (Finite-State Machine Channels (FSMCs)) A time-independent, discrete-time finite-state machine 
channel [4] has an input process . . . , Xq, X\, . . ., an output process . . . , Y_i, Yq, Yj_, . . ., and a state process 
. . . , S-i, So, Si, . . ., where Xi S X, G y, and Si G S for all £ G Z. We assume that the alphabets X, y, and 
S are finite. Let us define the following finite windows of the FSMC states, inputs, and outputs as: 

c A „N a_ TV A „N m 

s — s -iv+i> x — x -at+ii y — y-N+1- y L ) 

2 Note that Gallager [4] calls them finite-state channels (FSCs). 
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Fig. 2. Trellis representation of the finite-state machine behind an FSMC. Because of the assumed time-invariance, the branch labels in 
every trellis section are the same. Therefore, we have shown these branch labels only in one trellis section. 
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Fig. 3. Trellis section of the trellis representation of the finite-state machine behind a Gilbert-Elliott channel. State "1" corresponds to 
the bad state "b" and state "2" corresponds to the good state "g". 

Unless otherwise stated, we condition all FSMC-related probabilities on an initial state, such as s_jy. The joint 
FSMC conditional pmf decomposes as 

^s,Y|x,5_„(s ) y|x,s_ Ar ) = Yl Ps e ,Y e \s e - u x e {s^ ye\se-i,xe) (2) 

= II p s t \s t - u x t {at\H-i,xt) II p Y e \s e - 1: x e ,sM\ s e-^ x ^ s e) » ( 3 ) 
yei N J \eei N J 

where Ps e \s e -i,x e an d PYAS t -i,x t ,s t are referred to as the FSMC state transition probability and FSMC output 
probability, respectively, and are independent of the time index £. In the following, we will use the notation 

W(8i\8i-i,xi) = Ps e \s e - u x e {se\se-i,x e ), (4) 

W(y e \se-i,xe,s e ) = PY e \s e - u x e ,s t {yi\ s i-^ x ^ s i)^ ( 5 ) 

W(s, y |x, s_iv) = P s ,Y|x,5_ N ( s > y l x i s -n), (6) 

W(y\x,s^ N ) = P Y \x,S- N (y\*,s-N) = W(s,y\x,s- N ). (7) 

s 

□ 
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Fig.[2] shows how the finite-state machine behind a typical FSMC can be represented by a trellis. This specific 
FSMC has 4 states, namely S = {1, 2, 3, 4}, and two input symbols, namely X = {— 1, +1}- To a branch going 
from state i to state j we associate a label like (a^-j/jy): it shows the probability pij with which this state 
transition is chosen in case that the channel input symbol is Xij. 

Example 4 (Partial Response Channel) The trellis in Fig.[2] actually represents the trellis of the finite-state 
machine that is behind a so-called partial response channel. Such a channel can be described by 

^ h m xe-m + n e \ . (8) 

m=0 / 

Here, \i is a quantization function that maps elements of E to y. Moreover, {x{\, {yi}, {up}, {h m } represent 
the channel input process, the channel output process, an additive noise process, and the filter coefficients, 
respectively. Assuming the channel input alphabet to be X = {— 1, +1} and the memory length to be M = 2, 
we see that in Fig.|2] the state = 1 corresponds to (X£_i,JQ) = (—1,-1), the state = 2 corresponds 
to (Xe-i,Xe) = (—1,-1-1), the state Si = 3 corresponds to [X(.-\,Xi) = (+1,-1), and the state Se = 4 
corresponds to (JQ_i, JQ) = (+1,+1). □ 

Example 5 (Gilbert-Elliott Channel) The Gilbert-Elliott channel [18], [19] has the state alphabet S = {g,b}, 
i.e., a "good" state and a "bad" state, the input alphabet X = {0, 1}, and the output alphabet {0, 1}. One defines 

W{si,y(\si^Xi) = W{s t \si-. 1 ) ■ W(yi\si-i,xi) where W(b\g) = p h and W(g\b) = p g , where W{y e \s e -i,x e ) 
is a binary symmetric channel (BSC) with cross-over probability e g when = g, and where W(yi\s^i, xi) 
is a BSC with cross-over probability eb when si-\ = b. A trellis section of the trellis representation of the 
finite-state machine behind such a channel is shown in Fig.[3] (Here, p g , pb, e g , and eb are arbitrary real numbers 
between and 1, where usually je g — 1/2 [ > — 1/2|.) □ 

It is useful to introduce the FSMC branch random variable Bi = (Se-i,Xe, Si) and the set B of all branches 
in a trellis section. (Note that because the original FSMC is assumed to be time-invariant, the set B is also 
time-invariant.) The initial state of a branch € B at time index I will then be denoted by S£_i(bg), the channel 
input symbol by x(be), and the final state by se(be). It can easily be seen that without loss of generality, we can 
assume that for any triple (s£-i,X£,3£) there is at most one branch in the trellis that starts in state 8£-\, has 
input symbol X£, and ends in state S£. In that sense, in the following we will use the notation 

W(ye\bt) = P Yt \ Bll {yi\h) = Py^Si-x^StiyiW-ux^si) = W{y f \s f ^i,x h S£) (9) 

for b( = (s^-i, X£, S£). Similarly to s, x, and y, we also define b = b N N+l which helps us define 

W(b, y|x, s- N ) 4 W(s, y|x, s^ N ). (10) 

In this and upcoming similar expressions, x and s are implicitly given by b, i.e., x = x(b) and s = s(b). The 
sequence b is called a valid branch sequence if the ending state of bg_\ equals the starting state of b£. Similarly, 
the sequence s is called a valid state sequence if there is a valid branch sequence b such that s = s(b). As 



already mentioned in the introduction, summations like J2 S and ^ b will always be over all valid channel state 
sequences and all valid channel branch sequences, respectively. 

In this paper, we consider only FSMCs that are indecomposable, as defined by Gallager [4], i.e. channels 
where, roughly speaking, the influence of the initial state fades out with time for every possible channel input 
sequence. Feeding such a channel with a stationary and ergodic source results in input and output processes that 
are jointly stationary and ergodic. Therefore, when feeding an indecomposable FSMC with a source as in Def. [2] 
we will use the notation 

(QW)(y\s- N ) 4 ^Q(x|xl^) W(b,y\ X ,s. N ). (11) 

b 

Moreover, in order to be able to apply the results of [20] later on, we will impose the following condition on 
W: for all bi € B and all y^ 6 y, we require that W(yi\bi) is strictly positive, i.e., we require that any output 
symbol yi G y can potentially be observed for any branch bt G B. 

Definition 6 (Data-Controllable FSMCs) If an indecomposable finite-state machine channel can be taken from 
any state into any other state by a finite number of channel inputs, which do not depend on the current state, the 
channel is called controllable. (Referring to [4, p. Ill and p. 527], we note that there are also decomposable 
channels that could be called controllable and for which the unconstrained capacity is uniquely defined. However, 
in the following we will not consider such channels because we deal exclusively with indecomposable channels.) 
□ 

Clearly, the partial response channel in Ex. 0] is data-controllable, whereas the Gilbert-Elliott channel (cf. [5]) 
is not data-controllable. 



C. General Channels with Memory 

We will allow the original channel to be a more general stationary discrete-time channel than an indecomposable 
FSMC, namely, we allow the state-space size to be infinite, as long as the following requirement is satisfied: it 
should be possible to approximate such a general channel to any desired degree by an indecomposable FSMC. 

D. Original Backward Channel Model 

Reversing the usual meaning of X and Y, i.e., looking at X as being the output of some channel which is 
fed by Y which in turn is produced by some source, we arrive at the "backward" channel model in Fig. CUb). 
Note that the mutual information /(X; Y) is a functional of the joint pmf of X and Y. Therefore, if X and Y 
have the same joint pmf in both the forward and the backward channel setup then both channels will have the 
same information rate. This can be achieved by setting the pmf of the "output" source to be R(y) = (QW)(y) 
and the "backward" channel law to be V(x|y) = Q foR y l x) because then Q(x)W(y|x) = R(y)V(x\y). 



E. Auxiliary Forward Finite-State Machine Channels 

The information rate upper bound in Section [Hi] will be based on an auxiliary forward channel law. By an 
auxiliary forward channel law we will simply mean a conditional pmf on y given x. In general, we will denote 
this conditional pmf by W"(y|x) and it fulfills the usual properties of a conditional pmf, i.e., W(y\x) > for 
all x and all y, and W"(y|x) = 1 for all x. 

In the following, we will focus on the case where the auxiliary forward channel is an auxiliary forward 
finite-state machine channel (AF-FSMC). 

Definition 7 (AF-FSMCs) A time-independent and discrete-time AF-FSMC has an input process . . . , X—i, Xq , X\ , 
an output process . . . , Y—i, Yo, Y\, . . ., and a state process . . ., S—i, So, Si, . . ., where Xi £ X, Yg £ y, and 
Si € S for all i G Z. We assume the set S to be finite. Let us define the following finite windows of the AF-FSMC 
states and branches as: 

§ = §Vi. b = b-;v + i- (12) 
The AF-FSMC conditional pmf decomposes as 

iy(s,y|x,s_Ar) = Y[ W(s£,y e \s e -i,xi) = H W(st\st-i,x e ) (II W(ye\se-i,xi,s e ) > ( 13 ) 

where W{s(\s^x,X() and W(ye\si-i,X{, §e) are referred to as the AF-FSMC state transition probability and 
AF-FSMC output probability, respectively, and are independent of the time index t. The input-output conditional 
pmf will then be 

W(y|x, s_jv) = ^( § > yf x ' S -^)- (14) 

s 

□ 

It is useful to introduce the AF-FSMC branch (random) variable = (Sg-i,Xi, St) and the set B of all 
branches in a trellis section. (Note that because the AF-FSMC is assumed to be time-invariant, the set B is also 
time-invariant.) The initial state of a branch bp at time index I will then be denoted by si-\{pe), the channel 
input symbol by x(be), and the final state by se(be). In that sense, we will often write W(yi\be) instead of 
W(yt\st-x, X£, st), and W(h, y|x, S-n) instead of W(s, y|x, s_jv) if b = (x, s). As in Section Hl-BI without 
loss of generality we can assume that for any triple (s~t-i, xt, Sf) there is at most one branch in the trellis that 
starts in state §t-i, has input symbol xt, and ends in state Sf 

Similar to the original channel, we consider only AF-FSMCs that are indecomposable and for which W(yi\b£) 
is strictly positive for all be € B and all y£ € y. 



Remark 8 (Induced pmfs) Of special interest is the case where x is generated according to the standard 
source: the induced joint pmf of x, s, and y is then called P(x, s, y|s_Ar) and equals P(x, s, y|s_Ar) = 



Q(x)W(s, y|x, S-n). In that sense, 

P(b,y\s_ N ) ± P(x,s,y\s_ N ), (15) 
P(x,y[s_Ar) = ^P(x,s,y|s_Ar), (16) 

s 

P(y|s_jv) = 2P(x,s,y|5_ JV ) = £ P(x, y|s_jv) = (QWOfrlS-jv), (17) 

X,S X 

P(b|y, s_at) = — , (18) 

P{y\s-N) 

P(b|x, y, s- N ) = — : -, (19) 

E P(b',y|s_7v) 

b' : b' is compatible with x 

where in the last expression we assume that x is compatible with b. □ 

F. Auxiliary Backward Finite-State Machine Channels 

The information rate lower bound in Section HIT] will be based on an auxiliary backward channel law. By an 
auxiliary backward channel law we will simply mean a conditional pmf on x given y. In general, we will denote 
this conditional pmf by V"(x|y) and it fulfills the usual properties of a conditional pmf, i.e., V"(x|y) > for all 

x and all y, and E x ^( x ly) = 1 f° r au Y- 

In the following, we will focus on the case where the auxiliary backward channel is an auxiliary backward 
finite-state machine channel (AB-FSMC). 

Definition 9 (AB-FSMCs) A time-independent and discrete-time AB-FSMC has an input process . . . , Y—\, Yq, Yy, . 
an output process . . . , X^\, Xq, X\, . . ., and a state process . . ., S—\, Sq, S\, . . ., where Y# G y, Xi G X, and 
Si G S for all t G Z. We assume the set S to be finite. Let us define the following finite windows of the AB-FSMC 
states and branches as: 

§ = §Vi, 6 = #Vi- (20) 

The AB-FSMC conditional pmf is defined to be 

T >/~ I - \ a w(s_jv,s,x,y) 

V(s, x y, S- N ) = — - — - (21) 

v(s- N , s',x',y) 

s',x' 

with 

v(s- N ,s,x,y) = Q(x) • JJ v (%-i,X|,S|,y^ 2 V (22) 

Here, D\ and D2 are some arbitrary non-negative integer^ and v(st-\, X£, S£, y^^ 2 ) is a non-negative function 
of [(s£^i,X(, §(), y^^ 2 ) G B x y D i+ D i+ 1 . Note that the latter requirement on X(, S(, y^ 2 ) is sufficient 

3 We will comment on the selection of suitable values for Di and D2 at the end of Section [ill] and in Section ITV-CI In the expressions 
V(s, x|y, S-n) and u(s_jv, §, x, y), it would actually be more precise to write y_%^_\_£ ) instead of y (= y^jv+i) m me case tna t 
Di > and/or D2 > 0. However, in order to keep the notation simple and because we are mostly interested in the limit N — * 00, we 
will not do so. 



for V(s, x|y, s_jv) to represent a conditional pmf of s ancf x g/ven y and 77ie backward input-output 

conditional pmf will then be 

V(x\y,8- N ) ±Y,V(s,x\y,s- N ). (23) 

s 

□ 

In the following, we will also use 

-t)(s_Ar,x,y) = ^z)(s_Ar,s,x,y), (24) 

s 

T >/*i - \ a V'Cs, x|y, s_jv) ^(s-JV,s,x,y) t)(s_Ar,s,x,y) 

V(s x, y, s__at) = . — ■ — ; = „ r = —77 t-. (25) 

Eg'^(s> x ly>«-jv) Ls' u ( s -.tf> s > x >y) v{s^ N ,x,y) 

It is useful to introduce the AB-FSMC branch (random) variable Bg = (S£_i , JQ, Sf) and the set B of all 
branches in a trellis section. (Note that because the AB-FSMC is assumed to be time-invariant, the set B is also 
time-invariant.) The initial state of a branch b\ at time index £ will then be denoted by the (backward 

channel) output symbol by x(bi), and the final state by sg(S^). In that sense, we will often write v(be,y^^, 2 ) 
instead of v(si-. 1 ,xt,st,yj^' i ), V(h\y,s^ N ) instead of V(s, x|y, s_at) if b = (x, s), and V(b\x, y, §- N ) 
instead of V"(s[x, y, s_jv) if b = (x, s). As in Section Hl-BI without loss of generality we can assume that for 
any triple (§t-i,X£, §£) there is at most one branch in the trellis that starts in state has (backward channel) 
output symbol X£, and ends in state sg. 

Similar to the original channel, we consider only AB-FSMCs that are indecomposable and for which v(b(, y^ol) 
is strictly positive for all b e G B and all y\t%\ e y D ^+ D ^+ l . 

G. Remarks 

Let us briefly point out some notational differences with the notation used in the paper on the generalized 
Blahut-Arimoto algorithm [10]. Similar to that paper, we are using x to denote the source output / channel input 
sequence and y to denote the channel output sequence. However, in the current paper s and b denote the state 
and branch sequence, respectively, in the trellis representation of the channel, whereas in [10], s and b were used 
to denote the state and branch sequence, respectively, in the trellis representation of the source. Finally, note that 
in both papers the time indexing of the components of the input, output, state, and branch sequences are done 
in the same manner, however, s is defined to be s^ N+1 and not s^. 

We have already mentioned that we will be mainly interested in the limit N — > oo. Because our setup is such 
that the limits of the expressions of interest do not depend on the past x~^ and the initial states S-n and §-n 
(this can be justified using results like in [20]), in the following we will assume that xl^, s_jv, and s^n are 
fixed to some suitable and mutually compatible values. In order to simplify the notation in all the upcoming 
formulas, we will omit the explicit conditioning on xZ^, S-at, and s_n, however, it should be kept in mind 
that such a conditioning is still present. 

4 The letter v was chosen to show the closeness of the v function to the V function, yet to remind the reader that the v function is not 
properly normalized to be a conditional pmf of s and x given y and s_jv. 
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In the following, a generic branch of an AF-FSMC / AB-FSMC will often be denoted by b = (s p , x, s) instead 
of b\ = (§g-x,X£, sg). (Here, "p" stands for previous.) Similarly, the generic output symbol corresponding to this 
branch will often be denoted by y instead of ye. Moreover, the generic y— will be used instead of y^Jp . In this 
manner, v(be,y e e ^ 2 ) in an AB-FSMC will simply be denoted by v(b, y— ). Such simplifications in notation are 
possible because of our stationarity assumptions for sources and channels. 

III. Information Rates and Their Upper and Lower Bounds 

In the following, the source pmf and the original (forward) channel law will be assumed to be fixed, and we 
will only vary the auxiliary forward and auxiliary backward channels. Therefore, in order to simplify the notation, 
Q and W will not appear as arguments of information rates, upper and lower bound formulas, etc. (Note that a 
fixed source pmf and a fixed original forward channel law imply, according to our comments in Section III-DI 
that the "output" source pmf and the original backward channel law are also fixed.) 

Definition 10 (Information Rate) For the type of sources and channels that were considered in Section [77] the 
information rate is given by 

j(N) a J_ /(x Y)4iW Q(x)W(y|x) log ( W{Y l*\ ) , (26) 

with asymptotic version I = lim/vr^oo O 

Note that for any finite N, the information rate I^ N > depends on the choice of xl^ and s_at, however, we will 
not mention these quantities explicitly as arguments of l( N >. Similarly, we will not mention them in upcoming 
functionals, along with not mentioning the initial state s_tv of an auxiliary channel. 

Definition 11 (Information Rate Upper Bound Based on an AF-FSMC) Using an AF-FSMC as defined in 
Def. an upper bound on the information rate is given by 

with asymptotic version I{W) = lini7v->oo I N (W). This upper bound was also used in [3], [11] and we refer 
the interested reader to these papers for some historical comments. □ 

The fact that 1 {N) {W) is indeed an upper bound on I^ N > can easily be verified by writing the difference 
I N (W) — as a Kullback-Leibler (KL) divergence [21, p. 18] and by using the well-known fact that the 
KL divergence is never negative, i.e. 

f N \w) - 1^ = D y ((QW)(y) \\(QW)(y)) > 0. (28) 

It is noted that for computing the upper bound I N (W), an analytical or numerical evaluation method for the 
conditional entropy 

H(N) - -^E^ x )^l x ) lo §(^(yl x )) (29) 



of the output process given the input process in the original channel is required. (The asymptotic version of 
will be called H.) Alternatively, a lower bound on H^ N > can be used to obtain an upper bound on I {N) {W). 
Later in Section JX] we prove a tight lower bound on H^ N > for Gauss-Markov fading channels. 

Let us briefly mention that the expression in d2Tb is still a valid information rate upper bound also if (QW)(y) 
is replaced by some arbitrary pmf over y. However, we will not pursue this more general information rate upper 
bound any further in this paper. 

Definition 12 (Information Rate Lower Bound Based on an AB-FSMC) Using an AB-FSMC as defined in 
Def. [9] a lower bound on the information rate is given by 



(30) 



with asymptotic version I_(v) = limjv— >oo 1^ (v), and where V is implicitly defined by v as shown in (1211 )- 
□ 

Again, that 1^ (v) is a lower bound on J W can easily be verified by writing the difference l( N ) — J} N > (v) 
as a KL divergence, i.e., 

/(AO _ j(at)(£) = Dx y (Q( x )w(y\x) || (W(y)V(x|y)) > 0. (31) 
On the side, let us remark that l( N ) — (v) can also be written as 

-£<""(«) = BTO A ( q ™ ) X) I ^W) ■ (32) 

The lower bound defined in (l30l is linked with the generalized mutual information (GMI) which is defined 
as [22] 

I ^ e -a<«'d( N >(x,y) 

a<~)>o— £ x ,Q(x')e-^ w (x',y)' 

where dS N > (x, y ) is a mapping from X 2N x y 2N to R. Under suitable time -invariance conditions on dS N \ 
this GMI has the following interesting meaning: it is a reliable communication rate that is achievable under 
mismatched decoding, i.e., when the decoder uses the decoding metric S N \x,y) instead of the ML decoding 
metric — log W(y|x). (Maximizing the GMI over all possible sources gives a lower bound on what is known 
as the mismatch capacity under the decoding metric d^(x,y), N — > oo.) Now, setting = 2N and 



J gmi-^7 SU P JZ Q(x)W(y|x) log ^ ^__ A _ n(N)MN) ^, ^ , (33) 



ix, y; 2N g { Q(x) 



(34) 



where /(y) is an arbitrary positive function over y, we see that the mutual information rate lower bound in 
Def. [12] is a special case of the GMI. This means that I_( N \v) is a reliable communication rate that is achievable 
under mismatched decoding with the decoding metric d^ N \x,y) as in d34l ), i.e., a decoder that is matched to 
the AB-FSMC and mismatched to the original channel^ 

5 Note that if there is a function /(y) such that Q(x) = /(y)^( x |y) f° r all x > then /(y)^( x |y)/Q( x ) can t> e seen a conditional 
pmf of y given x. In this special case, the mismatched decoder is the ML decoder that is matched to the auxiliary channel. 



Remark 13 (First Special Case of Information Rate Lower Bound) A special case of lj N \v) is obtained by 
setting y(x|y) = Q(-x)W (y|x) / (QW)(y) where W is some arbitrary AF-FSMC law. The lower bound then 
reads 



_Q(jt)W(y|x) 



V(xlv) 

v (QW)(y) x ,y 



-4VQ(x)^(y|x)l gf #(y|x) V (35) 



This lower bound was also used in [3], [11] and we refer the interested reader to these papers for some historical 
comments. Please note that this specialized lower bound is the lower bound that was optimized in the preliminary 
version of this paper [14]; this is in contrast to the present paper where we will optimize the more general lower 
bound that was given in Def. [T2j □ 

Ideally, we would like to define the difference of the information rate upper bound for any AF-FSMC law W 
and the information rate lower bound for any AB-FSMC law V. However, in order to obtain something tractable 
for optimization purposes, it turns out to be expedient to use a V that is implicitly defined through W as in (|35T ). 



Definition 14 (Difference Function) Let W"(y|x) be the law of some AF-FSMC. The difference function is 
defined to be 



AW(1V)4/(W)-/W(V) 



(Qw)(y> x,y 



— YQ(x)W(y\x)lo S (V^-) (36) 



;Ac, y Q(x)iy(y|x) Q(x)iy(y|x) ) , (37) 



1 

~ 2N' 
with asymptotic version 



Minimizing this difference function would not be significant if we could directly find the global minimum of the 
upper bound and the global maximum of the lower bound. Moreover, a minimized difference between the bounds 
does not necessarily mean that the individual upper and lower bounds are optimized. Nevertheless, we will see 
that the minimization of the difference function can give a useful initialization point for the iterative optimization 
of the upper and lower bounds. Such an initialization can result in faster-converging iterative algorithms or 
tighter upper and/or lower bounds. This is especially the case for partial response FIR channels, where the AF- 
FSMC parameters that minimize the difference function are found in closed form. For fading channels, using 
the parameters of the optimized difference function for initializing the maximization of the lower bound often 
yields better lower bounds compared to using other initialization methods. However, using the parameters of the 
optimized difference function for initializing the minimization of the upper bound bound usually does not yield 
better bounds compared to using other initialization methods. We refer the reader to Section IVIIII and Section HXl 
for more details. 

Let us conclude this section with yet another special case of the information rate lower bound. 

Remark 15 (Second Special Case of Information Rate Lower Bound) In Def. [9] the only requirement on 
v(s£_i,X£, §£, y^Di) was that it is non-negative. Imposing additionally the condition that Y^g y^^) 
1 for all §£-i, all x£, and all yft^, one can verify that X^sx^(^' x >y) = 1 f° r an y anc ^ a ^ $~N, see 



App. U Therefore, the denominator in (l2TT > is 1 which means that V(s,x\y) = v(s,x,y). Finally, this implies 
V"(x|y) = u(x, y), and so the lower bound ( [30b reads 

i (JV) («) ^ ^EQW^yW^ (^) • ( 38 > 

□ 

Let us briefly comment on the number of parameters of AB-FSMCs. From Def. [9]it follows that the number of 
parameters is \B\ x \y\ Dl+E>2+1 . It is clear that larger D\ and D2 lead to better lower bounds but also to the need 
of estimating and storing more parameters. Given the exponential growth in D\ + D2, it is obviously desirable 
to choose D\ and D2 as large as needed, yet as small as possible. Empirical evidence shows that D\ = and 
D2 = (the smallest possible choice) is sufficient for the Def. [12] lower bound to give good results. However, 
non-zero values of D\ and D2 are usually needed for the Rem. [15] lower bound to work well. (Given typical sizes 
of |3^|, a positive sum of D\ and D2 yields a significant increase in AB-FSMC model parameters.) Although 
we will see in Section IVIII that it is more difficult to get a handle on the optimization of the Def. [12] lower 
bound, which is in contrast to the nice analytical properties of the optimization of the Rem. [15] lower bound, the 
Def. [12] lower bound will be the lower bound of choice. Indeed, all examples in Sees. IVIIII and [IX] will use it 
with Di = and D 2 = 0. 

IV. Optimization Methods and Definitions 

The first main objective of the present paper is the minimization of the information rate upper bound that 
was presented in Section [ill] This optimization takes place over the set of all AF-FSMCs that have the same 
trellis section B, i.e., for a given B we will optimize over all possible {^(sjsp, x)f The second 

main objective of the present paper is the maximization of the information rate lower bound that was presented 
in Section [ill] This time, the optimization takes place over the set of all AB-FSMCs that have the same trellis 
section B, i.e., for a given B we will optimize over all possible {v(b, y— 

Note that the direct optimization of the information rate upper and lower bounds is in general intractable. This 
is mainly due to the fact that (QW)(y), W"(y|x), and V"(x|y) in the logarithms are not readily decomposable 
into suitable products. This, in turn, prevents their optimization in a mathematically tractable manner. 

As a way out of this problem, we will use iterative approaches which after every iteration yield better bounds. 
As we will see, each of these iterations can be seen as the optimization of a suitably chosen surrogate function. 
The rest of this section is devoted to the presentation of the general ideas; the mathematical details and the proofs 
will be treated in later sections. 

Although it very often makes sense that the type of the AF-FSMC is chosen such that it matches the type of 
the original channel, e.g., that a controllable AF-FSMC is chosen if the original channel is a controllable FSMC, 

6 ln the following, we assume that the trellis section B is such that in the relative interior of the set of all possible settings of 
{W / (s|s p , x)} U the conditions in Section Hi-El that were imposed on AF-FSMCs are fulfilled. 

7 ln the following, we assume that the trellis section B is such that in the relative interior of the set of all possible settings of {v(b, y— )}, 
the conditions in Section Hl-FI that were imposed on AB-FSMCs are fulfilled. 




w 



w w* 

Fig. 4. Iterative minimization of the upper bound using a surrogate function. 

please note that such a restriction in choice is not required for the optimization algorithms that we are about to 
present. 

A. Optimization Approach for the Information Rate Upper Bound 

The underlying idea for optimizing the upper bound is as follows (see Fig. @]): 

• Set the iteration number to t = 0. Start with an initial AF-FSMC model with channel law W^°K 

• Assume that at the current iteration we have found an AF-FSMC model with channel law with the 
corresponding information rate upper bound I(W®). 

• At the next iteration we would like to find a "better" AF-FSMC model with channel law W^ t+1 \ which 
results in a tighter upper bound. To this end, we introduce a surrogate function ^f(W^ ,W) which lo- 
cally approximates I(W) around W = W®. More explicitly, the surrogate function is chosen such that 
^(W®, W®) = 7{W®) and such that it is never below the upper bound, i.e., ^(W^,W) > I(W) for 
all W. 

• Let us assume for the moment that we can find such a surrogate function ^>{W^\W) that can be easily 
minimized and let us call W^ t+l ^ the channel law that achieves the minimum of V(W®,W) over W (one 
such function is given in Section [V}. 

• Using this approach, we obtain a new channel law W^ t+1 ^ that does not increase the upper bound, i.e., 
I(Wi t + 1 )) < ^(^<*> ; ^<*+i>) <^(W®,W®) =I(W®). 

• We increase t by one and then repeat the above procedure until some termination criterion is met. 

It is important to note that unlike the idealistic situation depicted in Fig. |4l I(W) is in general not a unimodal 
function of W and therefore, I(W) can have multiple local minima. Later in Section El we will study the 



17 



convergence properties of the surrogate function for the upper bound. 

We will use the following nomenclature: a surrogate function that is never below a certain function will be 
called a never-below surrogate function. On the other hand, a surrogate function that is never above a certain 
function will be called a never-above surrogate function. 

B. Optimization Approach for the Difference Function 

The underlying idea for minimizing the difference function is very much similar to minimizing the information 
rate upper bound. Of course, we will need a different class of never-below surrogate functions: the surrogate 
function at the t-th iteration will be called &a(W® ,W). 

C. Optimization Approach for the Information Rate Lower Bound 

Plugging the definition (|2TI ) of V in terms of v into the information rate lower bound ( f3Qb . we obtain 



In the second special case of the information rate lower bound, cf. Rem. \T5\ the summation in the denominator 
of the logarithm in d39l) is always one and therefore only the numerator of the logarithm depends on v. Maximizing 
I_( N \v) is then very similar to minimizing the information rate upper bound and the difference function. This 
similarity comes from the fact that in both the negative upper bound and the negative difference function, only 
the numerator of the logarithm depends on the auxiliary FSMC parameterization. Correspondingly, we are using a 
never-above surrogate function for the present optimization. Our practical experience shows that non-zero values 
for D\ and D2 must be used in order to obtain tight lower bounds. Given that the size of the output alphabet is 
usually not too small, this means that the AB-FSMC has many parameters, which is somewhat undesirable. 

In the general setup of the lower bound, i.e., when we are not assuming the restriction of Rem. \15\ the 
choice D\ = and D2 = seems to be sufficient. In this case however, the numerator and the denominator 
of the logarithm in d39l ) depend on v and we have not been able to derive a surrogate function for which 
we can analytically guarantee that it is never above the lower bound. Therefore, instead of using a surrogate 
function $_(y^,v) for which we can prove that it is never above L {N \v), we will use a surrogate function 
$(6", v) for which we can prove the following two properties: firstly, \J/(uw, v) equals I(v) when v = v®, and 
secondly, the gradient of ^_(v^\v) w.r.t. v equals the gradient of I_{v) w.r.t. v when v = Such a surrogate 
function guarantees that we will be moving in a good direction, however, it does not guarantee that we obtain a 
non-decreasing lower bound after each iteration. Note though that our surrogate function of choice will have a 
parameter 7 which enables one to control the "aggressiveness" of the optimization step. Adaptively setting this 
7 parameter allows one to have a non-decreasing lower bound after every step. 

D. Some Key Quantities for Iterative Optimizations 

The following quantities will be repeatedly used for iterative optimizations in later sections. These quantities 
are evaluated at the currently found AF-FSMC channel law W = or the currently found AB-FSMC channel 
law v = u'*'. Therefore, we use the superscript ~ to denote these quantities. 




(39) 



Definition 16 (f^ N \b), f^ N \b,y), and T 2 (A ° (6, y£)) 

• In the case of the upper bound and difference function minimization, define 
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2N 



eel, 



(40) 



N be 



fW(b,y) 4 ^(QW)(y) — £ £P(S,|y) [S, = b\ [ye =y]\, (41 ) 

y \ £el« 

w/f/j asymptotic versions T\(h) = liio.jv-«x> T± N \b) and 2^(6,?/) = lim/v_»oo ^2 2/)- ^ ere > -P(My) 
defined in the spirit of Rem. [3] a«<i [6^ = b] is defined to equal one if bi = b and to equal zero otherwise, 
etc. Note that 2^(6) = J2 y T^ N \b, y) for all b. 
In the case of the lower bound maximization, define 



1 



y \ £eJ„ 

wY/j asymptotic version T2(b, y— ) = limjv_ +0 o T^ip, y— ). 



6, = 6 



(42) 



□ 



Algorithm 17 (Numerical/Stochastic Computation of f x (iV) (S), T 2 (7V) (6, y), and T 2 (7V) (S, y^)) Direct evaluation 
of the above quantities for all possible realizations of channel output y is prohibitive. However, there are efficient 
stochastic procedures to numerically approximate them to a precision that is good enough for practical purposes. 
Namely, we replace the ensemble average with the time average and proceed as follows: 

• Generate a sequence of channel output y of length 2iV in the original channel according to its pmf (QW)(y). 

• For the generated output sequence y in the original channel, compute P{b\\y) = xg, se\y, s_at) for 
all t £ Zjv by applying the BCJR algorithm [13] to the AF-FSMC model. (Note that b\ includes the channel 
input X£.) 

b, add the computed P(be\y) to f^ N \b). Similarly, whenever bi = b and yi = y, add the 



|yj tof 2 W(6,y). 



• Whenever bg 

computed P{v t \y } ^ ± 2 

(Of course, in the case of the lower bound optimization, P(bz\y), T^ib, y), and "AF-FSMC" have to be replaced 
by V{bi\y), T 2 (S, y— ), and "AB-FSMC", respectively, in the above sentences.) Because of the assumptions that 
we made in Section Hfl the resulting estimates are equal to T[ N \b), T 2 (6, y), and T 2 (S, y— ) with probability 

1 as N —>■ oo. 

□ 



Definition 18 (T 3 (7V) (6), T^>(b,y), and T 4 l 



In the case of the upper bound and difference function minimization, define 
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rf )(6) 4 ^Q( x )w(y|x) [ -i- £ E^l x >y 



1 



2iV 




(43) 



,y) 4 ^Q(x)W(y|x) — ]T ^P(6 £ |x,y) |S* = S| [a* = s(6) [ W = y] | , (44) 

with asymptotic versions 2^(6) = limjv->oo ^3^(&) T/±(b,y) = nmjv^oo ^i^C&j y)- #ere, P(S^|x, y) 
w defined in the spirit of Rem. \8\and xi = x (b) is used to emphasize that the summands off^ N \b) and 
T^ N \b, y) are non-zero only when the i-th channel input xp in x is compatible with the input symbol of the 
AF-FSMC branch b, denoted by x(b). Note that T 3 (iV) (S) = 2^(6, y) for all b. 
In the case of the lower bound maximization, define 



Tf } (6,y^) 4^Q( x )W(y|x) [ -i- £ £V(Mx,y) 



2N 



tax. 



x e = x(b) 



V *+D 3 _ D 

ye-Dx — y 



n be 



with asymptotic version T^fr, y 



D\ A 



lim 



, (45) 



□ 



Algorithm 19 (Numerical/Stochastic Computation of ff'(b), ff'{b,y), and ff'{b,y^-)) We proceed as 
follows: 

• Generate a sequence of channel input x of length 2N according to the source distribution Q(x). 

• Generate a sequence of channel output y of length 2N in the original channel according to its channel law 
W(y|x). 

• For the generated input and output sequences x and y, compute P(6g|x,y) = P(s^_i, xi, s^|x, y) by 
applying the BCJR algorithm [13] to the AF-FSMC channel. Obviously, P(6^|x, y) will be zero if the £-th 
element of x does not equal x{bg). 

• Whenever l>£ = b and X( = x(b), add the computed P(6^|x, y) to T% N \b). Similarly, whenever hi = b, 



X£ = x(b), and yi = y, add the computed P(S^|x, y) to f^ N \b,y), 



(Of course, in the case of the lower bound optimization, P(6^|x,y), T^ N \b,y), and "AF-FSMC" have to be 



replaced by V(6^|x, y), \b,y— ), and "AB-FSMC", respectively, in the above sentences.) Because of the 



assumptions that we made in Section ITTI the resulting estimates are equal to T 3 (6), T\ n> (b\ y), and T 4 
with probability 1 as N — > oc. 



□ 



V. Optimizing the Information Rate Upper Bound 

Assume that at the current iteration we have found an AF-FSMC model with channel law and the 

corresponding information rate upper bound I(W^'). In order to simplify notation, we will use W instead of 
W®. Let 

* {N \W, W) 4 f N) {W) + ± *£(QW)(y) (P(b|y) ||p(b|y)) (46) 



be the surrogate function for the upper bound / (W) and let its asymptotic version be ty(W, W) = liniTv^oo ^ (W, W). 
In the surrogate function, the conditional pmfs P(b|y) and P(b|y) are induced by the channel laws W and W, 
respectively (cf. Rem. [8]). 

Lemma 20 (Important Properties of We recognize the following properties of the surrogate function: 

1) For any W, the function ^ N \w,W) is never below I^ N '(W). Moreover, for W = W, ^ N \w ,W) 
equals 1 {N) (W). 

2) The function ^ {N) (W, W) can be simplified to 

y {N) (W,W) = c^ N \W) - J>g (w(s\s p ,xj) fi N \b) - £^>g (W(y\b)) f^ N \b,y), (47) 

S b y 

where (W) is independent of W and where (b) and (6, y) were defined in (1401) and (1411 ), 
respectively. Note that s p denotes the previous (or left) state of b = (s p ,x,s). 

3) The function ^ N) (W, W) is convex in W, i.e., in {W(s\s p ,x)} U (iy(y|6)}. 

Proof: See Appendix ITT] □ 



Lemma 21 (Minimizing the Surrogate Function ^) Assume that we are at iteration t and that W = W®. 
The W that minimizes ty(W, W) at the next iteration is given by 

W*(s\s p ,x) oc fi(S) (for all b= (s p ,x,s) G B), (48) 
W*(y\b) oc f 2 (b, y) (for all b £ B and all y G y), (49) 

where the proportionality constants are chosen such that the constraints 

^2w(s\s p ,x) = 1 (for all (s p ,x) £S x X), (50) 

s 

J2w(y\b) = l (forallbeB) (51) 

y 

are fulfilled. 

Proof: See Appendix In] □ 



We observe that the update equations in Lemma |2T1 are not dissimilar to the update equations for the Baum- 
Welch algorithm [23], which was proposed for parameter estimation in hidden Markov models (HMMs)E In 
contrast to the Baum-Welch algorithm, here we are also averaging over y. Note that some simplifications in the 
update equations arise in the case where s p and x determine the next state s. 

With these results in our hand, the proposed iterative optimization of the upper bound is given as follows. 



8 The Baum-Welch algorithm is an early instance of the EM algorithm. The EM theory was later generalized in 1977 by Dempster, 
Laird, and Rubin [15]. 



Algorithm 22 (Iterative Optimization of the Information Rate Upper Bound) 

1) For all b G B and all y G y, set (s\s p , x) and W^\y\b) to some positive initial values. 

2) Set t = 0. 

3) As long as desired or until convergence, repeat the following steps. 

a) Set W = and use Algorithm M to estimate {f^ N \b)} and {f^ N \b, y)}. 

b) Update {W^ t+1 \s\s p ,x)} and {W^ t+1 \y\b)} according to <|48])-(l5D. 

c) Increase t by one. 

d) Go back to step [3aJ 
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□ 



Lemma 23 (Convergence Properties of the Iterative Optimization) For any finite N, all limit points of an 
instance of Algorithm \22\ are stationary points of I N (W). Similarly, for N — > oo, all limit points of an instance 
of Algorithm \22\ are stationary points of I(W). 

Proof: This can be proven by using Wu's convergence results for the EM algorithm [24]. In the case N — > oo, 
the required continuity of ^(W,W) in W and W follows from results in [20], [25]. □ 

Note that in general, I N (W) and I(W) can have local maxima where the algorithm can get stuck. However, 
because of the stochastic evaluation of {2\ (S)} and {T^^S, y)\, these local maxima tend to be instable. 

VI. Optimizing the Difference Function 

Assume that at the current iteration we have found an AF-FSMC with channel law and with corresponding 
difference function value A(W^). In order to simplify notation we will use W instead of W^'. Let 

(W, W) 4 A^) (W) + ± E Q^) W (y l x ) (^(b|x, y) I P(b|x, y) ) (52) 

x y 

be the surrogate function for the difference function A^ N \W) and let its asymptotic version be ^a(W,W) = 
lirn/v->oo *&aP W). In the surrogate function, the conditional pmfs P(b|x, y) and P(b|x, y) are induced by 
the channel laws W and W, respectively (cf. Rem. [8]). 



Lemma 24 (Important Properties of We recognize the following properties of the surrogate function: 
1) For any W, the function ^\w,W) is never below A^ N \W). Moreover, for W = W, ^\w,W) 



equals A( N \W). 
2) The function ^\w,W) can be simplified to 

^(W,W) = c$?(W) - J>g (w(s\s p ,x)) T 3 (iV) (S) - ]T^log (#(#)) T 4 (iV) (S, y) , (53) 



& b y 



where c A (W) is independent of W and where T 3 (6) a«<i T 4 (6, y) were defined in (l43l awe? (l44l . 
respectively. Note that s p denotes the previous (or left) state of b = (s p ,x,s). 



3) The function V£>(W,W) is convex in W, i.e., in {W(s\s p , x)} U {l^(y|6)}. 
Proof: See Appendix |IV| 
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Lemma 25 (Minimizing the Surrogate Function ^a) Assume that we are at iteration t and that W = W®. 
The W that minimizes ^f^(W, W) at the next iteration is given by 

W*(s\s p ,x) oc f 3 (b), (for allb= (s p ,x,s) eB), (54) 
W*(y\b) oc T 4 (S, y), (for all b € B and all y G y), (55) 

where the proportionality constants are chosen such that the constraints 

^2w(s\s p ,x) = 1 (for all (s p ,x) eS x X), (56) 

s 

W(y\b) = 1 (for all b G B) (57) 

y 

are fulfilled. 

Proof: The proof is very similar to the proof of Lemma |2TJ We leave the details to the reader. □ 



Remark 26 (Simplification for Data-Controllable AF-FSMCs) In case the AF-FSMC is a data-controllable 
FSMC, the input sequence x determines the state sequence s and therefore also the branch sequence b. This 
leads to simplifications in the computation of 7/3(6) and T^(b,y). 

In particular, if the AF-FSMC is a partial response channel with memory order M, the state at time index 
I — 1 is S£_i = ~\. With this we obtain 7/3(6) = 1 for all branches b G B and so W*(s\s p ,x) = 1 for all 
6 = (s p , x, s) G B. Moreover, for any 6 G B and y G y we have 



Ub,y)= £ Q(x°_ oo )W(y |x c 



M 7 7 -M+ 



J=6> 2/o=?/ 



with the corresponding formula for W*(y\b). In particular, if the original channel is a partial response channel 
with memory order M then we can simplify this expression even further. Indeed, if M > M then 



f 4 (b,y)= Q(x-m) W{y \x- C - 



Mj 



x_^ ,x°_ ft+l )=a, y =y 



and if M < M then 



f 4 (S,y) = Q(x° M ) Vy(y |x° 



l(x _^,So,xO A+1 )=S, yo=y " 



□ 



With these results in our hands, the proposed iterative optimization of the difference function looks as follows. 



Algorithm 27 (Iterative Optimization of the Difference Function) 

1) For all b € B and all y G y, set (s\s p ,x) and W^\y\b) to some positive initial values. 

2) Set t = 0. 

3) As long as desired or until convergence, repeat the following steps. 

a) Set W = and use Algorithm [13 to estimate {f 3 (A °(6)} and {f[ N \b, y)}. 

b) Update { W^ +1 > (§\s p , x)} and {#< i+1 > (y\b)} according to $54b-$2B- 

c) Increase t by one. 

d) Go back to step |3al 

□ 

Lemma 28 (Convergence Properties of the Iterative Optimization) For any finite N, all limit points of an 
instance of Algorithm [27] are stationary points of A^ N \W). Similarly, for N — > oo, all limit points of an 
instance of Algorithm \27\ are stationary points of A(W). 

Proof: This can be proven by using Wu's convergence results for the EM algorithm [24]. In the case — > oo, 
the required continuity of ^ &(W ,W) in W and W follows from results in [20], [25]. □ 



VII. Optimizing the Information Rate Lower Bound 

Assume that at the current iteration we have found an AB-FSMC model with channel law f)w and the 
corresponding information rate lower bound I_(v^). In order to simplify notation, we will use v instead of 
v®. The derivation of a suitable surrogate function will be somewhat longer than the corresponding derivations 
in the case of the upper bound and the difference function. 

We start by plugging the definition (l2TT>-(|23b of V in terms of v into the information rate lower bound (|30l 
and obtain 

^£ Q ™*)io g ( ^teL_ ) . 



l{ N) (v)+li N) (v), where 



For the following consideration, it will be useful to split this expression into three parts, i.e., (v) = £q (v) + 

li N \v) = ~ ^ E ^( x ) lo § (^( x )) • ( 59 ) 

X 

Z ^ )( ^- + 2^^ g(x) ^ (y|x)1 ° g fe* (§ ' X ' y) ) ' (60) 

x, y \ s / 

^^--^Eww 1 ^ fe^'x'.y) • (6D 

y \s',x' J 



It is clear that Jq (v) is independent of v, therefore, in order to derive a suitable surrogate function }ff( N \v) 
for jS N \v), it is sufficient to focus on I_[ N \v) and I^\v). Our approach will be to derive surrogate functions 
}£_l(v) and ; (v) for Jj («) and (v), respectively, which we will then combine. We define the first 
and second surrogate functions to be, respectively, 

, T ,(A0/~ ~n a UN),~ 1 STr>/ \ixrr I \n f ^( § ' x >y) i)(s,x,y) \ 

& (.,«)- W-^EQ(x)^(y|x) ^ Ey{ . ( ., X|y) Es ,, (a , X!y) j. <«) 



SS^^^l-EE; ?S Tf)(6,y^), (63) 



where 7 is some arbitrary positive real number, where (v) is chosen such that ^_^\v,v) = li N) (v), and 
where T 2 (iV) (S, y-) was defined in d42l 



Lemma 29 (Important Properties of We recognize the following properties of the first surrogate function: 

1) For any v, the function (u, w) is never above I_[ N ^ (v). Moreover, for v = v, }&_^ (v, v) equals I_[ N ^ (v). 

2) The function J^^ (t>, u) can be simplified to 

*™ (v,v) = i N) (v) + J2 E lo g y-)) 2f° & y^) (64) 

where (u) is independent of v and where (S, y— ) was defined in ( I45I ). 

3) The function }&i\v,v) is concave in v, i.e., in {v(5, y— )}. 

Proof: See Appendix IVl □ 



Lemma 30 (Important Properties of _^_ 2 ) W g recognize the following properties of the second surrogate func- 
tion: 

1) For £> = {), the function value (v,v) equals the function value Jg («). 

2) For u = t;, f/ie gradient of^^'(v, v) w.r.t. v equals the gradient of 1$ (v) w.r.t. v. 

3) 77ie function ^_^\v,v) is concave in v, i.e., in {v(b, y— )}. 

Proof: See Appendix IVT1 □ 

Based on the above definitions, we are ready to introduce our surrogate function for the information rate lower 
bound 

^ N \v,v)^c^ N \v)+^[ N \v,v)+^\v,v), (65) 

with its asymptotic version be }&(v } v) = Umjv-i-00 (v,v), where c^ N \v) is chosen such that ^ N \v,v) = 
l( N \v). 



Lemma 31 (Important Properties of \j£) We recognize the following properties of the surrogate function: 



1) For v = v, the function value ^ N \v,v) equals the function value lj N \v). 

2) For v = v, the gradient of ^ N \v,v) w.r.t. v equals the gradient of I} N '(v) w.r.t. v. 

3) The function ^ N \v,v) is concave in v, i.e., in {v(b, y— )}. 

Proof: Follows from Lemmas [29] and [30] □ 



Lemma 32 (Maximizing the Surrogate Function W) Assume that we are at iteration t and that v = The 
v that minimizes }&(y, v) at the next iteration is given by 

v*(b,y^) = ( T 4 ^' y ~M * v(b,y£) (for all b e B and all y^ € yD 1 +D 2 +i^ (66) 
\T 2 {b,yD)J 

Proof: Follows easily from the definition of (v,v), the reformulation of N \v,v) in Lemma [29] and the 
gradient expressions for }£g (v, v) in App. |VI] □ 

With these results in our hand, the proposed iterative optimization of the lower bound looks as follows. 

Algorithm 33 (Iterative Optimization of the Information Rate Lower Bound) 

1) For all b € B and all y— e yDi+D 2 +l ^ set ^(o)^yD^ to some positive initial values. Fix some 7 > 0. (In 
the simulation sections we will talk about suitable strategies for (adaptively) choosing the value of 7.) 

2) Set t = 0. 

3) As long as desired or until convergence, repeat the following steps. 

a) Set v = *)<*> and use Algorithms [17] and E§] to estimate {T 2 (A °(6, y^)} and {T 4 (iV) (6, y^)}. 

b) Update (S, y— )} according to (f66b . 

c) Increase t by one. 

d) Go back to step [3a] 

□ 

Remark 34 (Normalization) Note that no normalization is involved in the update equation (l66l ). However, 
because of the way that v appears in the numerator and denominator of the logarithm in (158T ). the following 
normalization can be applied to v(b\y—) (if desired): all v(b,y—) can be multiplied by some positive constant 
that is independent of b and y— . It is important to notice that this normalization is different from the normalizations 
that were applied in Lemmas [2T] and |25] □ 

Remark 35 (Convergence) Because we cannot prove that our }&} N \v,v) is a never-above surrogate function, 
we cannot make convergence statements like those in Lemmas |23] and |28] □ 

Remark 36 (Simplification for Data-Controllable Auxiliary FSMCs) In data-controllable AB-FSMCs, f 4 (7V) (b, y^) 

has a simplified closed-form and can be pre-computed before optimization iterations. Please refer to Remark l26l 
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for more details. This means that for the optimization of the lower bound for data-controllable channels, one 
only needs to evaluate T 2 (S, y— ) at each iteration. □ 

Remark 37 (Optimization of the Alternative Information Rate Lower Bound) Assuming the additional con- 
straint on v(b,y—) that we imposed in Rem. [T5J we have ^sx^' x 'y) = 1 f° r au y anc ^ so ±2 (^) = 0- 
Therefore, the surrogate function can be chosen to be ^ N \v,v) = c£ N ~)(v) + ^_[ N \v, v), where c^ N \v) is 
chosen such that $^ N \v,v) = I_( N \v) and where ^_[ N \v,v) is defined as in (l62l ). It follows from Lemma [29l 
that & N \v,v) is a never-above surrogate function. We leave the details to the reader to derive an optimization 
algorithm where at each step the value of the lower bound is non-decreasing. 

As already alluded to at the end of Sec. [Till we have just seen that the optimization of the Rem. [15] lower 
bound has nicer analytical properties. However, we remind the reader of our previous remark that usually strictly 
positive choices for D\ and D2 are needed to yield good lower bounds. We leave it as an open problem to find 
suitable modifications of this variant of the lower bound such that the desired analytical optimization properties 
are retained yet less parameters are needed. □ 

VIII. Numerical Results for Partial Response Channels 

In this section, we provide numerical results for the optimization of the upper and the lower bound for (output- 
quantized) partial response channels, which are connected to a binary, independent and uniformly distributed 
(i.u.d) source. In Section I VIII- A I we present the channel model and in Section IVIII-BI we discuss three differ- 
ent initialization methods for starting the iterative optimization methods. Section IVIII-CI describes the Soblex 
optimization method for a later comparison with the proposed techniques. In Section IVIII-DI we discuss how 
the proposed algorithms may provide tight information rate bounds with less computational complexity, when 
compared to computing the information rate in the original channel with a large memory length. Section IVIII-E I and 
the following subsections provide an investigation of the convergence properties of the optimization algorithms. 
Simple channels are used for this purpose to reduce the computational investment. 

A. Source, Channel, and Auxiliary Channel Models 

As mentioned earlier in this section, we assume a binary i.u.d source with alphabet X = {— 1,+1}. The 
original channel is an (output-quantized) memory-lengfh-M partial response channel with input alphabet X, with 
discrete output alphabet y (that will be specified later), and which is described by 



Vt 



/ M \ 
M ^ h mXl-m + ni J . (67) 

\m=0 / 



Here, \i is a quantization function that maps elements of E to y, and {xe}, {ye}, {n{\, {h m } represent the 
channel input process, the channel output process, an additive white Gaussian noise (AWGN) process, and the 
filter coefficients, respectively. We assume that the coefficients h = [ho, hi, ■ ■ ■ , /»m] wq known. Such knowledge 
is not necessary for the optimization procedure in its strict sense, because one only needs to know a realization of 
an input/output process pair ({xe}, {ye}) - However, knowing the channel coefficients helps us start optimizations 
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at more appropriate initial points (see the three initialization algorithms described in Section IVIII-Bb . As usual, 
this channel can be described by an FSMC with |«S| = 2 M states, whereby each state can be labeled by the M 
previous channel inputs and where there are two valid outgoing branches for each state, hence, \B\ = 2 M+l . 

The conditional pmfs W{si\si—\, xg) and W{yt\bg) are then defined in the obvious way. 

The AF-FSMC (and similarly the AB-FSMC) is chosen to have a trellis structure that is the same as the 
trellis structure of an (output-quantized) partial response channel with memory M, with input alphabet X , output 
alphabet y, and the same quantization function \i as for the original channel. In the case of the upper bound and 
difference function, we will optimize over the AF-FSMC parameters {W"(s^|s^_i,x^)} and {W"(y|£>)}, whereas 
in the case of the lower bound, we will optimize over the AB-FSMC parameters \v(b, y— )}. 

B. Initialization Methods 

The parameters { W(se\st-\, x^)} of the AF-FSMC are initialized to the natural settings based on the above AF- 
FSMC trellis definition. For the parameters |V^(y|6)} of the AF-FSMC we consider three different initialization 
methods. 

1) Initialization based on channel coefficient truncation. In the first method, the initial parameters of the AF- 
FSMC are derived from truncating the original channel into its latest M + 1 coefficients h = [ho, ... , hs*]. 
If the memory length M of the AF-FSMC is greater than the original channel memory length M, then we 
fill the remaining M — M coefficients with zeros to obtain h = \ hp, . . . , /ia/>0, ... ,0]. In summary, we 

M+l M—M 

assume an AF-FSMC with channel response 

M 



yi = v 



hmxe- m + ne\ , (68) 



, m=0 



and find its output probability W(y\b). 

2) Initialization based on optimized difference function. In the second method, we select {^(y^)} to minimize 
the difference function. For partial response channels, W(y\b) has a closed form and can be pre-computed 
as discussed in Section IVT1 Remark l26l 

3) Initialization based on averaging. In the third method and for each channel output y € y, we average the 
original channel law W(y\b) over all original FSMC branches and assign it to all the AF-FSMC branches. 
That is, 

W(y\b) = — W(y\b) (for all y e y and all b G B). (69) 

' ' beB 

In all numerical analyses, we use the information rate lower bound as defined in Def. [12] with the setting D\ = 
and Z?2 = (and not the specialized information rate lower bound in Rem. IT3T >. Then, the parameters {v(b, y)} 
of the AB-FSMC are initialized based on the initialization of |V^(s^|s^_i, a^)} and { W^(y|&) }. Namely, for any 
of the above three initialization methods we set v(b,y) = W"(s^|s^_i, xg) ■ W(y\b). (Note that because D\ = 
and Z?2 = 0, we have y— = y.) 



C. Soblex Optimization Algorithm 

In the following subsections, we will compare our proposed optimization algorithms for the upper and the 
lower bound with an improved variation of the simplex algorithm (see [16] for the definition of the standard 
simplex optimization method). Standard optimization algorithms such as Powell or simplex [16] can easily be 
trapped by local minima. Instead, we will use a robust optimization method by combining the standard simplex 
algorithm with an initial sampling of the parameter space with the Sobol quasi-random sequences [16]. This 
method was originally proposed in [17] and was called Soblex optimization. 

Here is a brief explanation about how the Soblex algorithm works. The Soblex optimization is initially given 
a budget, in terms of the number of cost function calls. Within the initial budget, Soblex evaluates the cost 
function using the Sobol sequence and initializes a simplex-shaped subspace, which is constructed from points 
with the lowest costs. The Sobol sequence ensures that we can progressively sample the parameter space in a 
virtually uniform fashion. Unlike most optimization methods that start with a single point in space, the simplex 
algorithm starts from a region of space that can be made arbitrarily close to the global minimum by increasing 
the Soblex budget. Intuitively, if the budget is large enough, the simplex subspace can sufficiently close in on 
the global minimum to allow successful execution of the optimization algorithm. The Soblex algorithm has had 
promising performance results for efficiently finding the global optimum of problems with low dimensionality 
(three-dimensional rotation space in [17]). However, the dimensionality of the AF-FSMC / AB-FSMC parameter 
space is much higher and requires a large Soblex budget. We will use Soblex as a reference to demonstrate the 
superior efficiency and performance of our algorithms, compared to general-purpose optimization methods. 

D. Reduction in the Computational Complexity 

As mentioned in the Introduction and from a practical viewpoint, the capacity and the capacity-achieving input 
distribution of finite-state channels with not too many states (short memory lengths, M) is numerically computable. 
Therefore, for partial response channels with small memory lengths, the proposed optimization techniques may 
not provide an immediate advantage. The true benefit of the optimization techniques is for original partial response 
channels with large memory lengths, where we are able to obtain lower and upper bounds on information rates 
of such channels with less computational complexity. This is further clarified through the following argument. 

For binary signaling in a partial response channel with memory length M, the computational complexity for 
running the "Forward" or the "Backward" part of the BCJR algorithm is proportional to 2 M ■ (2N), which 
exponentially increases in the channel memory length M and is linear in the input/output window length 2N. 
Numerical computation of the information rate only requires the "Forward" part of the BCJR algorithm. Let us 
assume that we use an AF-FSMC / AB-FSMC with memory length M = \M/2~\ to iteratively optimize the 
bounds ([■] is the integer ceiling function) and we use a fixed 7 (such as 7 = 1) throughout the lower bound 
iterations presented in Section IVIII (see (|66"1)). As discussed in Sections [V] and IVIIi for partial response channels 
one only needs to iteratively compute T^(b,y) and T% > (b,y—), which requires one "Forward run" of the 
BCJR algorithm on the input/output sequence to compute the forward messages or a's, one "Backward run" to 
compute the backward messages or /3's, and one final "Forward run" to compute combinations of forward and 
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Fig. 5. Comparison of the relative computational complexity (in logarithmic scale) for numerical evaluation of the original information 
rate and for optimization of the upper and the lower bounds using simpler AF-FSMCs / AB-FSMCs. We have assumed that 3 iterations 
are needed for the optimization of each bound. For original channel memory lengths greater than or equal to M = 10, the optimization 
techniques clearly become computationally advantageous. 

backward messages or it's [13], [26]. 

Therefore, the computational cost of each iteration is 3 -2 M • (2N). For example, the total cost of 3 iterations for 
each upper and lower bound plus one final "Forward run" to actually compute each bound becomes 20-2 A/ • (2N). 
This is shown in Fig.|5] where the original partial response channel memory length M ranges from 1 to 16. The 
vertical axis shows the relative computational complexity in logarithmic scale by assuming the same input/output 
window length. The optimization techniques become computationally advantageous for original channel memory 
lengths greater than or equal to M = 10. 

To verify that the proposed technique can provide tight bounds for partial response channels with large memory 
lengths, we consider the 11-tap original channel with memory length M = 10 studied in [3]. The channel 
coefficients are defined as hi = 1/(1 + (i — 5) 2 ) for < i < 10. In contrast to [3] we also apply a quantization 
function \i that is based on partition points at 

-oo, -2.5(7,,, -2A5a y , -2Aa y , ... , +2Aa y , +2A5a y + 2.5a y , +oo, (70) 

where a y is the output standard deviation. The results are shown in Fig.[6l where the SNR is defined according 
to [3]. The first curve from the bottom shows the lower bound using an AB-FSMC with memory length M = 6 or 
with 64 states. The AB-FSMC is initialized based on the "difference function optimization initialization method" 
(as explained in Remark [26] the AB-FSMC parameters can be pre-computed for such an initialization) and no 
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Fig. 6. Information rate bounds for the quantized version of the memory-10 channel introduced in [3] using AF-FSMCs / AB-FSMCs 
with memory M — 6. It is clear from this figure that at low SNR, even a closed-form and non-iterative optimization of the difference 
function is enough to provide tight bounds at a much lower computational complexity. 

lower bound optimization steps are performed. Similarly, the first curve from the top shows the upper bound 
where the AF-FSMC is initialized based on the "difference function optimization initialization method" and no 
upper bound optimization steps are performed. If we allow two more iterations, we obtain the next set of inner 
curves which slightly tighten the bounds, especially for higher SNRs. For a fair comparison of the computational 
complexities of information rates and their bounds, we used a fixed 7 = 1 in the lower bound iterations (see 
(l66l l) and did not spend extra time to find a better 7 that would potentially result in higher lower bounds. It 
is clear from this figure that at low SNR, even a closed-form and non-iterative optimization of the difference 
function is enough to provide tight bounds at a much lower computational complexity (there are only 64 states 
in the AF-FSMC / AB-FSMC as opposed to 1024 states in the original channel). Even with 2 iterations, the 
complexity of computing bounds is still lower than computing the information rate in the original channel. It is 
also noted that at low SNR, the proposed upper bound is tighter than those studied in [3] using the reduced-state 
upper bound (RSUB) method with 100 states^ 

'it is reminded that the upper bounds in [3], Fig 8, and those given here are not directly comparable, because of the quantized output 
used in our set-up. However, our quantization is fine enough, so that the information rates do not change much even by increasing the 
quantization levels. Therefore, the general conclusion applies. 



E. Original Partial Response Channel with Memory Length M = 2 

In this subsection we study the convergence properties of the upper and lower bound optimizations for a simple 
partial response original channel with memory length M = 2. We also compare tightness of the optimized bounds 
with those obtained via the Soblex numerical optimization described in Section IVIII-CI 

Because this original channel has memory length M = 2, there are three normalized coefficients in (I67T ) given 

as 

[ho, hi, h 2 ] = [0.5774, -0.5774, -0.5774]. (71) 
We refer to this partial response channel as CH3. The quantization function is based on partition points at 

-oo, -2.25, -1.5, -0.75, 0, +0.75, +1.5, +2.25, +oo, (72) 

resulting in 8 quantized outputs and y = {1, . . . , 8}. The SNR is equal to dB as defined in [3]. 

The AF-FSMC / AB-FSMC used for this channel are output-quantized partial response channels with memory 
length M = 1. That is, the AF-FSMC / AB-FSMC channel will have |«S| = 2 states and \B\ = 4 branches. Since 
partial response channels are controllable, optimization over W(s\s p ,x) is not required (current state s p and 
input x immediately determine next state s). Therefore, we only need to optimize over W{y\b) or v(b,y). Since 
there are 4 possible branches and 8 output levels, optimization of W{y\b) is over 32 real-valued and non-negative 
parameters. For the upper bound, J2 y W(y\b) = 1 for all branches and this will further reduce the optimization 
to an optimization over 28 parameters. However, for the lower bound, there are no such normalization constraints 
on v(b,y), cf. Remark l34l 

1) Upper Bound Tests: Fig.|7] shows the effect of the three initialization methods discussed above upon the 
upper bound optimizations. The first 500 iterations of 3000 iterations performed are shown in this figure, where 
the remaining 2500 iterations improved the upper bound only in the order of 10~ 5 bits / channel use. To avoid the 
small fluctuations in stochastic evaluation of T 2 (b, y) in (|4TT > and to solely study the convergence behavior of 
the optimization algorithm, we used a single window of the channel output y with length 2N\ = 5 x 10 5 for all 
iterations. Three main observations are made from this figure. First it is noted that after the first 10 iterations, the 
improvements in for all initialization methods are at most about 7.3 x 10~ 3 bits / channel use. Therefore, for all 
practical purposes, we may stop the optimizations after only a few iterations. Second, for all three initialization 
methods the upper bound optimization algorithm virtually converges to the same number, where the difference 
between the upper bounds in the 3000-th iteration is in the order 10~ 8 bits / channel use. Third, we observe that 
initialization based on the "difference function optimization initialization method" does not necessarily result in 
the lowest upper bound in the first iteration (as can be seen from the figure, the lowest upper bound in the first 
iteration belongs to initialization based on the "averaging initialization method"). 

We also note that all three methods experience a flat period over which the improvement in the upper bound 
is almost negligible. However, this flat period is shorter when the optimization algorithm is initialized by the 
"difference function optimization initialization method" and its convergence is quicker than when the optimization 
algorithm is initialized by the "averaging initialization method". In fact, when using the "averaging initialization 
method", the upper bound improves only by 1.9 x 10~ 5 bits / channel use during the first 174 iterations, whereas 



both the "truncation initialization method" and the "difference function optimization initialization method" vir- 
tually reach the final value within about 125 iterations (the difference between the bounds at iteration 125 and 
3000 is less than 3.8 x 10~ 4 bits / channel use). 

The optimization of the upper bound has a low sensitivity to the numerical technique used for computing 
(&> u) i n (BP - To show this, we performed 100 runs of the optimization procedure using the optimum 
difference method and using 100 different (randomly generated) channel output windows y of lengths 2N\ = 
5 x 10 5 and 2N2 = 2 x 10 6 for computing the upper bound and T 2 (S, y) in (|41~1) . The number of iterations in 
each run was limited to 300. With the relatively small window length 2 N\ , the absolute and normalized standard 
deviation of the optimized upper bound over 100 run^l are 1.2 x 10~ 3 bits / channel use and 2.5 x 10~ 3 , 
respectively, which are negligible for all practical purposes. If we increase the window length by a factor of 4 
to 2N 2 = 2 x 10 6 , the absolute and normalized standard deviation of the optimized upper bound over 100 runs 
reduce to 6.3 x 10 -4 bits / channel use and 1.3 x 10~ 3 , respectively. 

Table U compares our proposed optimization method with the Soblex method explained in Section IVIII-CI 
Even for a very simple AF-FSMC with memory length M = 1 and 8 quantized outputs, the dimensionality of 
the parameter space is 28, which is very large. This makes it a difficult problem for a conventional optimization 
algorithm without any estimate of the initial point. Here, we use the Soblex algorithm, which provides a general 
and reasonable method for finding initial points and performs better than the standard simplex or Powell algorithm. 
In our comparative tests, we used initial budgets of 1000, 2000, and 10000 and ran the Soblex algorithm 100 
different times using 100 non-overlapping initial seeds for the Sobol sequence. The fractional tolerance (frac. 
tol.) [16], which affects the termination condition of the simplex algorithm, is either 10~ 5 or 10~ 6 . In the last 
row of the table, we have shown the performance of 100 stochastic runs of our proposed optimization technique 
(Algorithm [22] in Section |V) with 300 iterations. In all tests, the window length for computing the upper bound 
and f( N \b,y) is 27V~i = 5 x 10 5 . 

The second column in Table U shows the statistical mean of the upper bound over 100 runs. It is observed 
from this column that the mean upper bound decreases as the Soblex initial budget increases or as the fractional 
tolerance decreases. However, the proposed Algorithm [22] attains the minimum mean upper bound. Although 
the difference between the upper bound given by Algorithm [22] and those obtained by Soblex is small, it is 
consistently above the numerical tolerance of Algorithm [22] which was found to be 1.2 x 10~ 3 bits / channel use 
for 2A r i = 5 x 10 5 in the previous paragraph. In the third column of Table H we have used the worst-case upper 
bound in the 300-th iteration of Algorithm [22] over 100 runs as the reference, which is found to be / = 0.4789 
bits / channel use. Then, we have counted the percentage of Soblex runs where the optimized upper bound is 
worse than our worst-case performance. As can be seen from this column, the Soblex method does not perform 
as well as Algorithm [22] in terms of robustness in finding the lowest upper bound. The fourth column of this 
table shows the number of cost calls per run in the Soblex algorithm (computation of information rate bound) 
or in Algorithm [22] (computation of information rate bound and T 2 (b, y)). As discussed in Sections IVIII-DI 

10 The normalized standard deviation is unit-less and is obtained by dividing the standard deviation by the mean of the upper bound 
over 100 runs. 
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Fig. 7. Comparison of three initialization methods for the optimization of the upper bound. The original partial response channel has a 
memory length of M = 2 and the AF-FSMC has a memory length of M = 1, The SNR is equal to dB. Initializing the optimization 
algorithm by the "truncation initialization method" or the "optimized difference function initialization method" leads to faster convergence 
than initializing by the "averaging initialization method". It is noted that after the first 10 iterations, the improvements is at most about 
7.3 x 10 -3 bits / channel use (independently of which initialization method is used). 



computing {b, y) plus a new upper bound in each iteration is four times more complex than computing the 
upper bound only. Hence, we have 300 x 4 in the last column. Finally, the actual average time spent in each 
run is shown in the last column. It is observed that with the initial budget of 10000, Soblex has 10 — 11 times 
higher computational complexity than Algorithm [22] It is noted that the real advantage of Algorithm [22] is for 
more complicated AF-FSMCs with larger memory lengths and output levels, where the Soblex method becomes 
less reliable and computationally more challenging. 

After finding the optimum upper bound in one of the stochastic runs of Algorithm [22] with 300 iterations 
and window length of 2N\ = 5 x 10 5 , we passed the optimized AF-FSMC parameters to the standard simplex 
algorithm (with no initial budget to randomly sample space) to see if simplex method can further improve 



TABLE I 34 
Comparison of the Soblex optimization Algorithm [17] with the proposed Algorithm[22]in SectionIvI 
Algorithm[22]is superior to Soblex method both in terms of reliably finding the lowest upper bound and 

computational efficiency. 



Algorithm Specifications 


Mean upper bound, 


Percentage of runs with 


Number of cost calls / run 


Average time 




bits / channel use 


a bound above 0.4797 b/ch use 


or iteration complexity 


per run, sec. 


Soblex budget= 1000, frac. tol. = 10 -5 


0.4791 


53% 


2180 


218 


Soblex budget= 2000, frac. tol. = 10~ 5 


0.4789 


37% 


3244 


325 


Soblex budget= 10000, frac. tol. = 10 -5 


0.4788 


32% 


1111 


1117 


Soblex budget= 10000, frac. tol. = 10~ 6 


0.4783 


6% 


12741 


1278 


Algorithm |22] Section [V] 


0.4764 


0% 


1200(300 x 4) 


110 



the upper bound. The fractional tolerance for the simplex algorithm was set to 10 . It turns out that simplex 
algorithm can improve the upper bound found by Algorithm [22] by only 6.4 x 10~ 5 bits / channel use, which is 
negligible for all practical purposes. 

2) Lower Bound Tests: Fig.[8] shows the effect of the three initialization methods discussed in Section |VIII-BI 
on the lower bound optimization. The first 10 iterations of 3000 iterations performed are shown in this figure, 
where the remaining 2990 iterations improved the lower bound only by a maximum of 2.5 x 10~ 6 bits / channel 
use. To avoid the small fluctuations in stochastic evaluation of (b, y)0 used in (l66l ) and to solely study the 
convergence behavior of the optimization algorithm, we used a single window of channel output y with length 
2N\ = 5 x 10 5 for all iterations. All other parameters are the same as for the upper bound tests. At each iteration, 
the parameter 7 in (l66l ) was varied in the set V = {1,2,..., 10}. A small 7 corresponds to more freedom in 
updating v with respect to v, whereas a large 7 corresponds to more conservative update of v relative to v. In 
the limit as 7 — > 00, v = v. Our simple algorithm for choosing 7 is summarized as follows. 

Algorithm 38 (An algorithm for choosing 7 in (l66l) ) 

1) Choose a finite set for varying 7 such as V. 

2) At iteration t, update v according to each individual 7 in the set T using (l66l ) and compute the corresponding 
information rate lower bound. 

3) For the next iteration, choose the 7 and its corresponding v that results in the highest lower bound in the 
set considered. 

4) Increase t by one. Compute (5, y) (and T^ N \b,y) if needed) and go back to step 1. 

□ 

Obviously, the larger the size of the set F is the more time we spend in optimizing the lower bound in each 
iteration. So the practical size of T is chosen by taking this factor into account. However, we do not concern 

"Note that because we use Di = and D2 = in lower bound optimizations, we have y— = y in T 2 . 



ourselves with this issue here, because the purpose of the following analysis is understanding the convergence 
behavior of the algorithm and also its comparison with other local optimization methods such as Soblex. 

Two main observations are made from Fig.[8] First, the lower bound optimization has a fast-converging behavior. 
While optimization with initialization based on the "averaging initialization method" yields the worst initial lower 
bound, viz. / = 0, it converges very quickly and within 3 to 4 iterations. On the other hand, optimization with 
initialization based on the "optimized difference function initialization method" results in the highest lower bound 
in the first iteration and virtually converges within 2 to 3 iterations. Second, for all three initialization methods 
the optimization algorithm converges virtually to the same lower bound, where the difference between the lower 
bounds for the 3000-th iteration is in the order 10~ n bits / channel use. 

The optimization of the lower bound has a low sensitivity to numerical techniques used for computing 
f 2 (b, y) in (l42l ). To show this, we performed 100 runs of the optimization procedure with initialization 
based on the "optimized difference function initialization method" and using 100 different (randomly generated) 
channel output windows y of lengths 2Ni = 5 x 10 5 and 2N2 = 2 x 10 6 for computing the lower bound and 
Tjj j (S, y). The number of iterations in each run was limited to 50, due to the fast-converging behavior of the 
lower bound optimizations. We used T = {1, 2, . . . , 10} in Algorithm l38l With a relatively small window length 
2N\, the absolute and normalized standard deviation of the optimized lower bound over 100 runs are 1.0 x 10 -3 
bits / channel use and 3.4 x 10~ 3 , respectively, which are negligible for all practical purposes. If we increase 
the window length by a factor of 4 to 2N2 = 2 x 10 6 , the absolute and normalized standard deviation of the 
optimized lower bound over 100 runs reduce to 5.7 x 10~ 4 bits / channel use and 1.9 x 10~ 3 , respectively. 

Table [TT] compares our proposed optimization method with Soblex method explained in Section IVIII-CI Even 
for a very simple AB-FSMC with memory length M = 1 and 8 quantized outputs, the dimensionality of the 
unconstrained parameter space is 32, which is very large. This makes it a difficult problem for a conventional 
optimization algorithm without any estimate of the initial point. Here, we use the Soblex algorithm, which 
provides a general and reasonable method for finding initial points and performs better than the standard simplex 
or Powell. In our comparative tests, we used initial budgets of 1000, 2000, 5000, 20000, and 50000 and ran 
the Soblex algorithm 100 different times using 100 non-overlapping initial seeds for the Sobol sequence. The 
fractional tolerance (frac. tol.) [16] is 10~ 4 . In the last row of the table, we have shown the performance of 100 
stochastic runs of our proposed optimization technique (Algorithm [33] in Section IVIII ) with 50 iterations. In all 
tests, the window length of y for computing the upper bound and T 2 (b, y) is 2N\ = 5 x 10 5 . 

The second column in Table [n] shows the statistical mean of the lower bound over 100 runs. It is observed 
from this column that the mean lower bound increases as the number of initial budget increases. However, the 
proposed Algorithm [33] attains the maximum mean lower bound by a large margin, which is beyond the numerical 
tolerance of the algorithm. In the third column of Table [ill we have used the worst-case lower bound in the 50-th 
iteration of Algorithm [33] over 100 runs as the reference. The worst-case lower bound is I = 0.2970 bits / channel 
use. Then, we have counted the percentage of Soblex runs where the optimized lower bound is above (or better 
than) our worst-case performance. As can be seen from this column, Algorithm [33] is noticeably superior in 
terms of robustness in finding the highest lower bound. The fourth column of this table shows the number of 
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Fig. 8. Comparison of three initialization methods for the optimization of the lower bound. The original partial response channel has a 
memory length of M = 2 and the AB-FSMC has a memory length of M = 1. The SNR is equal to dB. Optimization with initialization 
based on the "optimized difference function initialization method" has the highest lower bound value in the first iteration. For all three 
initialization methods, the optimization algorithm virtually converges to the final lower bound value within 3 to 4 iterations. 



cost calls per run in the Soblex algorithm (computation of information rate lower bound) or in Algorithm 133] 
(computation of information rate lower bound and T^ib, y)). We used T = {1, 2, . . . , 10} in Algorithm 138] for 
choosing 7. Therefore, at each iteration of our algorithm, we computed T 2 (p, y) once plus ten lower bounds. 
According to the discussions in Section |VIII-DI this is 13 times more complex than computing the lower bound 
only. Finally, the actual average time spent in each run is shown in the last column. For example, with the initial 
budget of 20000, Soblex is about 37 times more computationally complex than Algorithm [33] The real advantage 
of Algorithm [33] is, however, for more complicated AB-FSMC with larger memory lengths and output levels, 
where the Soblex method becomes less reliable and computationally more challenging. 
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Comparison of the Soblex optimization algorithm [17] with the proposed Algorithm[33]in Sectiqn IviTI 
Algorithm[33]is superior to Soblex method both in terms of reliably finding the highest lower bound and 

computational efficiency. 



Algorithm Specifications 


Mean lower bound, 


Percentage of runs with 


Number of cost calls / run 


Average time 




bits / channel use 


a bound above 0.2970 b/ch use 


or iteration complexity 


per run, sec. 


Soblex budget= 1000 


0.2803 


3% 


5189 


875 


Soblex budget= 2000 


0.2826 


3% 


6149 


1037 


Soblex budget= 5000 


0.2845 


5% 


9127 


1539 


Soblex budget= 20000 


0.2848 


9% 


23916 


4032 


Soblex budget= 50000 


0.2883 


11% 


53939 


9096 


Algorithm |33l Section NTH 


0.3000 


100% 


650(50 x 13) 


109 



After finding the best lower bound in one of the stochastic runs of Algorithm [33] with 50 iterations and window 
length of 2Ni = 5 x 10 5 , we passed the optimized AB-FSMC parameters to the standard Powell algorithm to 
see if it can further improve the lower bound. The fractional tolerance for Powell algorithm was set to 10~ 4 . 
We observed that Powell algorithm can improve the lower bound found by Algorithm [33] by only 2.8 x 10 -5 
bits / channel use, which is negligible for all practical purposes. 

F. Original Partial Response Channel with Memory Length M = 3 

In this subsection we study the convergence properties of the upper and lower bound optimizations for a simple 
partial response original channel with memory length M = 3. Because the original channel has memory length 
M = 3, there are four normalized coefficients in (l67l) given as 

[huMMM = [0.5,0.5,-0.5,-0.5]. (73) 

This partial response channel is referred to as EPR4 in the literature [3]. The quantization function is based on 
partition points at 

-oo, -2.5, -2, -1.5, -1, -0.5, 0, +0.5, +1, +1.5, +2, +2.5, +oo, (74) 

resulting in 12 quantized outputs and y = {1, . . . , 12}. The SNR is equal to dB. 

The AF-FSMC / AB-FSMC chosen for this channel are partial response channels with memory length M = 2 
with |<S| = 4 states and \B\ = 8 branches. 

1 ) Upper Bound Tests: We studied the effect of three initialization methods discussed in Section I VIII-B I on 
the upper bound optimizations. We observed that the convergence behavior of the upper bound in the EPR4 
channel is similar to that of CH3 in Fig.[7] In particular, very fast convergence behavior with initialization 
based on the "truncation initialization method" or the "optimized difference function initialization method" is 
observed, where the optimization algorithm virtually converges within the first 10 iterations. For example, when 



the "optimized difference function initialization method" is used, the upper bound minimization algorithm starts 
at 0.4536 bits / channel use and reaches 0.4377 bits / channel use at the 10-th iteration and does not decrease 
noticeably after that. On the other hand, when the "averaging initialization method" is used, the convergence 
of the minimization algorithm is poor and it is not recommended for optimizations. For brevity purposes, these 
observations are not shown in a separate figure. 

2) Lower Bound Tests: Fig.[9] shows the lower bound optimizations for the EPR4 channel for 300 iterations, 
where initialization is based on the "optimized difference function initialization method". As can be seen 
from this figure, after only 2 — 3 iterations the lower bound reaches its maximum. For this test, we used 
r = {1,1.5,2,2.5,... ,9.5,10,100} in Algorithm 1381 for choosing 7 at each iteration, where 7 = 100 has a 
stabilizing effect for v. In our analysis, we observed that lower values of 7 result in drastically decreasing lower 
bounds over some iterations. In such cases, a large 7 prevents this situation by choosing a v that is very close 
to the previous v. As an example, if we fix 7 = 1 for all iterations, we observe that the lower bound would 
decrease/increase over iterations in oscillatory manner and the maximum lower bound obtained is slightly below 
the lower bound with variable 7 (the difference is 7.2 x 10~ 4 bits / channel use). Therefore, using a variable 7 
is beneficial for both stabilizing and potentially providing higher lower bounds. 

IX. Numerical Results for Fading Channels 

In this section, we provide numerical results for the optimization of the upper bound, the lower bound, and 
the difference function for fading channels. 

A. Source, Channel, and Auxiliary Channel Models 

Throughout this section, we assume that the source is characterized by i.u.d. binary constant power signaling 
(BPSK). The original channel is an (output-quantized) correlated and flat-fading channel (FFC), also known as 
frequency non-selective fading channel, and defined as 

ye = fj.(gixi + n e ) . (75) 

Here, fi is a quantization function that maps elements of R to some discrete set y, {nf\ is a complex- valued 
AWGN process with variance per dimension Nq/2 and independent real and imaginary components, and gg = 
age 1 ® 1 is the complex-valued FFC gain, with the FFC amplitude ag and the FFC phase 0g. The fading power is 
normalized to a 2 g = 0.5 per complex dimension. The average power of the channel input is £ s . Since the fading 
power is normalized, the average SNR per symbol is £ s /Nq. 

The actual realization of the time-varying FFC gain gg is unknown to the receiver and to the transmitter a 
priori. It is assumed, however, that the statistics of the time-varying FFC gain gi are known and do not change 
over time. Hence, the fading process is stationary. We assume that the fading process is a Gauss-Markov process 
as follows 



9l = age-i + wg. 
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Fig. 9. Optimization of the lower bound for the EPR4 channel with 300 iterations at an SNR equal to dB. The initialization is based 
on the "optimized difference function initialization method". We used V — {1, 1.5, 2, 2.5, . . . , 9.5, 10, 100} in Algorithm 1381 for choosing 
7 at each iteration, where 7 = 100 has a stabilizing effect for v. After only a few iterations the lower bound reaches its maximum. 



In this model, the FFC gain is complex-valued Gaussian distributed with Rayleigh-distributed FFC channel 
amplitude a£ and uniformly distributed FFC channel phase 9g. In d76T ), is the complex-valued and white Gaus- 
sian process noise. The Gauss-Markovian assumption is not absolutely necessary for optimizing the information 
bounds, especially the lower bound. However, it will facilitate obtaining tight lower bounds for H (= H(Y\X)) 
for the original fading channel, which is required for the computation of the upper bound (see d29l ). Section UTTb. a 
in (l76l ) determines the correlation of the fading channel gain at two successive time indices. We assume that this 
correlation coefficient is given as a = £'{G ! ^G ! |_ 1 }/2(Tg = Jo(27t/dT), which is the first correlation coefficient 
in Clarke's model [1], [27]. Jo lS tne zero-order Bessel function of the first kind, /d is the Doppler frequency 
shift, T is the transmitted symbol period, and the superscript * denotes complex conjugate. Moreover, according 
to Clarke's model, we note that the real and imaginary parts of the fading process {g{\ and the process noise 
{wi} are independent of each other. 

Direct evaluation of information rates in correlated fading channels with no channel state information is still 
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an open problem. Therefore, having access to tight upper and lower bound for such channels is valuable. It is 
noted that fading channels are not naturally finite-state channels. Nevertheless, the technique first proposed in [11] 
(see also [3]) is still able to provide upper and lower bounds for non-finite state fading channels using auxiliary 
FSMCs. The numerical results in this section will show that using the optimization techniques introduced in this 
paper, we are able to noticeably tighten these bounds on information rates of fading channels. 

The AF-FSMC (and similarly the AB-FSMC) is chosen to have a trellis structure with data-independent 
transitions (similar to the trellis that describes the Gilbert-Elliott channel, cf. Example [5]), with input alphabet X, 
output alphabet y, and the same quantization function \i as for the original channel. In particular, the number of 
states will be K a -Kg, where K a and Kg represent the number of auxiliary channel fading amplitude quantization 
intervals and auxiliary channel phase quantization intervals, respectively. Moreover, the transition between any 
two states is possible. In our experiments, we will look at the effect of using different values of K a and Kg for 
the AF-FSMC / AB-FSMC. 

B. Initialization Methods 

For fading channels, we mainly use two methods for initializing the AF-FSMC parameters. 

1) Natural initialization method (based on FSMC modeling of the fading channel amplitude and phase). FSMC 
modeling of fading channel amplitude and phase has been used in the literature for a variety of purposes, 
including receiver design, see [28], [29] and the references therein for more details. In this initialization 
method, we model the fading channel amplitude ae with K a states and the fading channel phase with 
Kg states and assume a first-order Markov transition between the states. This will result in a non-data- 
controllable AF-FSMC model with K a ■ Kg states, where the initial AF-FSMC state transition probability 
W(s\s p ,x) and channel law W(y\b) are readily computable. 

2) Initialization based on optimized difference function. In this method, we select an initial that 
(locally) minimizes the difference function. Unlike the case of partial response channels, there is no closed- 
form expression for the minimum of the difference function. Therefore, we use Algorithm [27] to iteratively 
minimize the difference function, which in turn is initialized by the "natural initialization method". 

Similarly to Section [VIIIL in all numerical analyses, we use the information rate lower bound as defined in Def.[T2l 
with the setting D\ = and D2 = (and not the specialized information rate lower bound in Rem.[T3T). Moreover, 
the parameters {v(b, y)} of the AB-FSMC are initialized based on the initialization of ^W(si\§£-i,xi)} and 
{W(y\b)}. Namely, for any of the above two initialization methods we set v(b,y) = W(si\se-i, xi) ■ W(y\b). 

C. A Lower Bound on H 

As mentioned in d29l ) in Section [Till computation of the upper bound requires knowledge of the conditional 
entropy rate 




(77) 
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for the original channel, which cannot be obtained using any auxiliary channel. (A similar statement holds for 
the computation of the difference function.) For partial response channels, this conditional entropy was readily 
computable for the original channel, as it was the conditional entropy with perfect CSI and only depended on 
the distribution of the AWGN. However, in fading channels, H is not easily computable. It should be mentioned 
that H has a closed-form expression in the case that the above channel model is such that the receiver does not 
use output-quantization and such that the sender uses constant power signaling, e.g., BPSK [30]. However, this 
expression is not readily applicable here, because we need to quantize the channel output yz for the optimization 
of the bounds 

In this subsection, we provide a lower bound on H, which is applicable for any quantized channel output and 
can be made as tight as required (at the expense of some computational complexity). Referring to [21, pp. 63-65], 
we note that, using stationarity of the involved processes, the conditional entropy rate can also be written as 

H = lim ff(y |Ylk +lJ X^ +1 ). (78) 

iv — >oo 

Using the fact that there is no feedback from the receiver to the transmitter and the fact that, given the channel 
input up to time index I, the channel output up to time index t is independent of the channel input from time 
£ + 1 on, this can be written as 

H = lim tf(lo|YZ^ +1 ,X° N+1 ). (79) 

N—>oo 

Using the fact that conditioning cannot increase entropy, we can obtain lower bounds on H. A good candidate 
for conditioning is a past fading channel gain such as G-d for D > 1. To see this, let us elaborate on 

HiYolYzlj+vX-N+i) as fo U° ws 

i/CTolYZ^X ^) > H(Y \YZ 1 N+V X°_ N+1 ,G- D ) (80) 

= H(Y \YZ 1 D+1 ,X - D+ i,G- D ) (81) 

= #(Y D |Yf-\X?,G ). (82) 



In ([811 ) we have used the fact that Yq given G-d is independent of all channel inputs and outputs up to and 
including time index — D, i.e., independent of Y~j^ +1 and X~^ +1 . Finally, the last equality follows form the 
channel being stationary. It can also be shown that by increasing the delay D in the above, tighter lower bounds 
on H will be obtained. 

As an example of how the above lower bound may be computed, let us consider the case where the conditioning 
delay is D = 2. Then, we need to find 

H(Y 2 \Y 1 , Xl G ) = H(Yl\Xl G ) - H(Y 1 \X 2 1 , G ) (83) 
= i?(Y?|X?,Go)- J H"(Yi|X 1 ,G ), (84) 

12 The closed-form expression of H given in [30] may provide an estimate of the conditional entropy for very finely quantized channel 
output and using the relation between discrete entropy and differential entropy given in [21, pp. 228-229]. 
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which requires knowledge about P(yf|xf,<;o) an d P(yi\x\, go). These quantities can be computed by integrating 
over the missing channel gains. Let us first consider P(yi\x\, go) which can be written as 

P(yi\xi,go) = J P(yi,gi\x 1 ,g )dg 1 (85) 

= J P(yi\xi,gi) f(gi\go)dgi, (86) 

where we have used the Gauss-Markovian property of the fading process and the fact that according to (|75l ), yc 
given xe and gn is independent of all other previous channel gains. In (|86*1 ). Py x \x x ,G^\ x \i 9\) i s tne area under 
the AWGN distribution between the two quantization points y\_ x and y^ with a mean of x\g\ and variance of 
Nq/2 per complex dimension. Also, f(gi\go) is the probability density function of gi given go, which according 
to (1761 ) is Gaussian with mean ago and variance a 2 g — 0" 2 a 2 = 0.5(1 — a 2 ) per complex dimension. 
Similarly, P(y 2 |x 2 , go) is expanded as 

^(yi|xi,5o) = J J P(yi,92,gi\^i,go)dg 2 dgi 

= j P(yi\gi,xi) f{gi\go) (^J P(y2\g2,x 2 ) f(g 2 \gi)dg<^\ dgi. 

In principle, computations of the above integrals should be carried out in the complex domain, because both 
channel output yi and channel gain gg are complex-valued. However, according to the independence property 
of the real and the imaginary parts of the AWGN and the fading process, one can simplify computations 
greatly by only evaluating the probabilities either for the real or for the imaginary dimension, computing the 
corresponding entropies, and multiplying the obtained result by a factor of 2. Evaluation of the lower bound can 
be similarly extended to larger conditioning delays D, which result in tighter lower bounds, but at the cost of 
more computational complexity. 

We have computed lower bounds on H using delays from D = 1 to D = 3 with a reasonable complexity. 
Fig. [10] shows the lower bound for conditioning delays from D = 1 to D = 3 for the Gauss-Markov fading channel 
with a = Jo(2-7t/dT) and the normalized Doppler frequency shift /dT = 0.1. The horizontal axis shows the 
average SNR. Each real and imaginary component of the channel output yi is quantized into 10 intervals using 
a quantization function that is based on partition points at 

— oo, — 2.0(7^, —1.5(7^, — 1.0a y , —0.5(jy, 0, +0.5a y , +l.0a y , +1.5a y , +2.0a y , +oo (87) 

where Uy is the average output variance per dimension and defined as cj 2 , = &g£ s + No/2. We note that for 
high SNR conditions, the entropy rate H visually reaches its limit with the conditioning delay of D = 3. In the 
following subsections, we will see that even with such modest conditioning delays, we can obtain good upper 
bounds for fading channels. 



D. Optimizing the Difference Function 

Fig. [TT] shows the iterative optimization of the difference function. The original fading channel has a normalized 
Doppler frequency shift of foT = 0.1 and the average SNR is dB. The window length of the channel output y 
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2 4 6 8 10 12 14 16 18 20 

Average SNR, dB 

Fig. 10. Lower bounds on the conditional entropy rate H — H(Y\X) in the original Gauss-Markov fading channel with various 
conditioning delays from D = lto.D = 3asa function of the average SNR £ s /Nq. The Gauss-Markov fading channel has a — 
J (27t/dT) and the normalized Doppler frequency shift is fr>T — 0.1. 

for optimizations is 2N = 1 x 10 6 . The number of states in the AF-FSMC model is set to 16 states. Initialization 
is done according to the "natural initialization method" using three different mappings of the fading channel 
amplitude and phase into the FSMC model. The difference function is computed using the lower bound for H 
with D = 3 as discussed above. Therefore, the values shown are, in fact, upper bounds on the difference function 
values. The figure only shows the first 500 iterations of a total of 3000 iterations, because the improvement 
in the difference function after the first 500 iterations was less than 1.9 x 10 -3 bits / channel use. Although 
the optimization algorithm eventually converge to the same difference value (independently of the initialization 
method), the optimization run with an AF-FSMC with more phase states {Kq = 8) and less amplitude states 
(K a = 2) converges faster. Also note that after a few iterations, the value of the difference function for this SNR 
is, at most, only about 0.06 bits / channel use, which shows the effectiveness of the optimization technique. 
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Fig. 11. Comparison of the difference function optimization algorithm with initialization by the "natural initialization method". The 
original fading channel has a normalized Doppler frequency shift of foT = 0.1 and the average SNR is dB. The AF-FSMC model has 
16 states. The "natural initialization method" is applied with three different mappings of the fading channel amplitude and phase into the 
AF-FSMC model. 



E. Upper Bound Tests 

Fig.[[2]shows 200 iterations for the optimizing of the upper bound. The original fading channel has a normalized 
Doppler frequency shift of /rjT = 0.1 and the average SNR is 16 dB. We have used the "natural initialization 
method" with Kg = 8 phase states K a = 2 amplitude states. It is observed from this figure that if enough number 
of iterations are allowed, one is able to optimize the upper bound well below the CSI upper bound (the new 
upper bound is 0.8643 bits / channel use, whereas the CSI upper bound is 0.9736 bits / channel use.) 

Furthermore, we observed that initializing the upper bound optimization algorithm by the "optimized difference 
function initialization method" may provide higher initial upper bounds that do not improve with further iterations. 
This may be due to the fact that this initialization method yields parameters where the upper bound has a local 
minimum, whereas the "natural initialization method" yields parameters at a non-stationary point of the upper 
bound, which can subsequently be improved iteratively. So for the upper bound, the "natural initialization method" 
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Fig. 12. Convergence of the upper bound optimizations using the "natural initialization method" of the Gauss-Markov fading channel 
into Kg — 8 fading phase states and K a = 2 fading amplitude states. The average SNR is 16 dB. From first to last iteration, one is able 
to reduce the upper bound significantly (by 0.36 bits / channel use), well below the CSI upper bound, if enough number of iterations are 
allowed. 



with about 100 to 200 iterations is recommended. 



F. Lower Bound Tests 

Table [III] shows examples of obtained lower bounds based on a variety of initialization methods. Namely, the 
"natural initialization methods" and the "optimized difference function initialization method" are used with four 
and five, respectively, different settings of amplitude and phase states. 

The original Gauss-Markov fading channel has a normalized Doppler frequency shift of fr>T = 0.1 and 
the average SNR is dB. The number of states in the AB-FSMC model is set to 16 states. We used T = 
{1, 1.5, 2, 2.5, . . . , 9.5, 10, 100} in Algorithm 1381 for choosing 7 at each iteration, where 7 = 100 has a stabilizing 
effect for v and does not let the lower bound decrease over iterations. It is clear from this table that initialization 
based on the "optimized difference function initialization method" can result in a higher information rate lower 
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The effect of different initialization methods on the lower bound. Better lower bounds are often obtained if 
the "optimized difference function initialization method" is used, which was observed to be a typical behavior 

of the lower bound across snr. 



Initialization Method 


Best Lower Bound Found, bits / channel use 


Natural, Kg = 16, K a = 1 


0.3357 


Natural, Kg = 8, K a = 2 


0.3501 


Natural, Kg = 4, K a = 4 


0.3345 


Natural, Kg = 2, K a = 8 


0.3056 


50 Iterations for Optimized Diff, Kg = 16, K a = 1 


0.3562 


50 Iterations for Optimized Diff, Kg = 8, K a = 2 


0.3580* 


3000 Iterations for Optimized Diff, K s = 8, K a = 2 


0.3608" 


50 Iterations for Optimized Diff, Kg = 4, K a = 4 


0.3539 


50 Iterations for Optimized Diff, Kg = 2, K a = 8 


0.3529 



bound. For this initialization method we also observe that increasing the number of iterations from 50 to 3000 
yields an improvement of only 0.0028 bits / channel use, which is negligible for all practical purposes. 

Finally, we would like to emphasize the importance of optimizing the lower bound from the mismatched 
decoding perspective. FSMC models have been proposed in the literature [3 1]— [36] for channel estimation and 
decoding in time-varying, continuous-valued, and continuous state-space fading channels (see [29] for a review 
of FSMC-based decoding in fading channels). However, the fact that such decoding is mismatched to the physical 
fading channel often goes unnoticed. In particular, the choice of FSMC parameters (type of FSMC, number of 
amplitude and phase states, the FSMC state transition probabilities, and the FSMC output probabilities) are often 
chosen on an ad-hoc basis and mostly using the natural mapping of fading channel amplitude and phase into 
FSMC models. On the other hand, the results in this section showed that by optimizing the AB-FSMC parameters, 
we can achieve higher lower bounds on mismatched decoding rates and therefore achieve higher information rates 
in AB-FSMC-based decoding for fading channels. 

G. Optimized Bounds Versus SNR 

Fig. [13] shows optimized upper and lower bounds for a practical range of SNRs and for the considered Gauss- 
Markov fading channel as in previous figures with a relatively fast fading rate of /qT = 0.1. We used T = 
{1,1.5,2,2.5,... ,9.5,10,100} in Algorithm 1381 for choosing 7 in the lower bound optimizations. The upper 
bound is considerably tighter than the CSI upper bound and together with the lower bound, can successfully 
bound the range of possible information rates. To the best of authors' knowledge, no such good bounds have 
previously been shown in the literature. 




Fig. 13. Optimized upper and lower bounds for a practical range of SNR and for the Gauss-Markov fading channel with the normalized 
fading rate of foT = 0.1. The upper bound is considerably tighter than the CSI upper bound and together with the lower bound, can 
successfully limit the range of possible information rates. 

X. Conclusions 

In this paper, we devised iterative algorithms for the minimization of an information rate upper bound and 
the maximization of an information rate lower bound for communication channels with memory. Moreover, we 
also discussed the minimization of the so-called difference function which represents the difference between the 
upper bound and a specialized version of the lower bound. 

Our proposed optimization techniques are EM-type algorithms and optimize auxiliary FSMC models under the 
constraint that the (time-invariant) trellis is fixed. We applied the optimization techniques to original channels that 
are finite state (such as partial response channels) and non-finite state (such as fading channels). In both cases, 
we observed that the optimization techniques considerably tighten the bounds with a reasonable computational 
complexity. This is particularly important for finite-state channels with a large memory length (M > 10), where 
the numerical computation of the information rate is complex, or for non-finite channels, where direct or numerical 
evaluation of the information rate is not possible. Using the proposed techniques, we improved the upper bound 
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of the considered fading channel by as much as 0.11 bits / channel use below the CSI upper bound in binary 
signaling, which together with the optimized lower bound, successfully provided bounds for the fading channel 
information rates. Optimizing the lower bound is also significant from a mismatched decoding viewpoint for 
receivers that are equipped with the decoding metric for the auxiliary FSMC model, which is mismatched to the 
original communication channel. 



Appendix I 
Proof of the Statement in Rem. [H] 

In this appendix we verify that ]Csx^(^> x ;y) = 1 if Y^$ e ^(^-i> x ii Y^dI) = 1 f° r ai l S£-it au x i, an d 
all y£g. Indeed, 
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Appendix II 
Proof of Lemma I 

For ease of reference, let us repeat here the relevant surrogate function: 

¥ N \w,w)±i (N \w) + ±Y,(Q w )(y) D i> {p$\y) \\ p (h\y)) . 



A. Property 1 

This follows from standard KL divergence properties. 
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B. Property 2 

Using the definition of 1 {N) (W) in we obtain 



= ^^Q(x)^(y|x)log (W(y|x)) 

- 2^ £(W(y) E lo s (( W(y)^ 

y b 

- ^ E( W(y) E p fiw l °s (HUy)) ■ m 

y 6 

We note that the first and third terms on the right-hand side are only functions of Q, W, W, and are independent 
of W. By combining these two terms as c[ N ^ (W) and by combining the second and fourth terms, we can write 

¥ N) (W,W) = cf\w) - A_^(g^)(y)^P(b|y)log (P(b,y) 



2N 

Now using the AF-FSMC's decomposition property ([T3T ) for P(b, y) = P(x, s, y), we can simplify the surrogate 
function further to 

^ N \w,W) = c[ N \w) 



- ^ E( W(y) E ^ly) lo s WW) 

y 6 

-^E(W(y)E^( fi iy) E lo § 

y b teXw 

y b eel N 

where we notice that the second term on the right-hand side is independent of W. Therefore, by combining 
Ci (W) and this term into (W) and after a few manipulations of the third and fourth terms, we obtain 

¥ N) (W,W) = cW(W) - J>g (W(s\s p ,x)) fi N \b) - £5>g (W(y\b)) f^ N) (b,y), (96) 

i h v 

where f[ N \b) and T 2 (A °(6,y) were defined in d40b and (RTl) . respectively. 
C. Property 3 

In order to prove convexity of VP (W", W) w.r.t. {W^(s|s p , x)} U we need to show that the 

corresponding Hessian matrix is positive semi-definite. This can be done as follows. First, we use the expression 
in d96l ) for VP (W", W) to obtain the following second-order partial derivatives: 

d 2 _ y(N) = fj N \b) - {N) = rf >(£) ^ (97) 



dW(s\s p ,x) (W(s\s p ,x)y dW(y\b) {W{y\b)f 

-¥ N) = 0, s °\ , * W = 0, ^ . . * (JV) = 0. (98) 



5W(s|s p ,a;)9W(s'|s' a/) SW(y \b)dW(y'\ V) dW(s\s p ,x)dW(y'\V) 



Secondly, we note that (5) and (b, y) are non-negative, cf. (l40l) and (HTI ). Thirdly, we combine these 
results and see that the Hessian matrix is diagonal with non-negative diagonal elements, i.e., the Hessian matrix 
is indeed positive semi-definite. 

Let us remark that in the above computations of derivatives we did not impose the constraint that {V^(s|s p , x)} 
and |l^(y|6)} lie in the corresponding probability simplices for every (s p ,x) and every b, respectively, and 
therefore we actually proved a stronger convexity result than really needed. Note that this approach of proving 
convexity worked well because the surrogate function is well-behaved also outside these probability simplices. 
This is in contrast to [10] where information rates where optimized as a function of source probabilities: for 
N — > oo, the information rate was not well defined outside the corresponding simplices and so it was important 
to take directional derivatives within the simplices. For more details we refer the reader to [10]. 

Appendix III 
Proof of Lemma EH 

In this appendix, we minimize the surrogate function * (W, W) subject to the following constraints 

^W{s\s p ,x) = 1 (for all (s p ,x) eSxX), (99) 

s 

W(y\b) = 1 (for all b € B). (100) 

y 

(For the moment we omit the non-negativity constraints; we will see at the end that they are automatically 
satisfied.) Clearly, the Lagrangian is 

l 4 * w { w, w)-J2 Mi&>, *) ( E w(s\s P , *)-i)-X>0) ( E wtv\b) - 1) , (ioi) 

where |/ii(s p ,x)} and {^2(6)} are Lagrange multipliers. Using the expression in (|47T ) for ^ (W, W) and 
setting the derivative of L w.r.t. W(s\s p ,x) equal to zero yields 

,t L = : l f[ N \b) - Ml (l p , x) = 0, (102) 
oW(s\s p ,x) W(s\s p ,x) 

which results in 

W*(s\s p ,x)= Tl } b) (103) 

Therefore, the Lagrange multiplier fii(s p , x) is just a normalization constant so that ^2$ ^*(s|s p , x) = 1. Since 
Ti(b) is non-negative, cf. (1401 . non-negativity of the pmf W*(s\s p , x) is automatically satisfied. The optimum 
setting of W*(y\b) is similarly found. 

Appendix IV 
Proof of Lemma [24] 

For ease of reference, let us repeat here the relevant surrogate function: 

0fW(W,W) 4 AW(f) + -L^Q( x )^W(y|x)L> 6 (p(b|x,y) ||p(b|x,y)) . (104) 



A. Property 1 

This follows from standard KL divergence properties. 
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B. Property 2 

Using the definition (l36l ) of difference function, the surrogate function in (11041 ) is written as 

*W(W,W) = -^Q(x)W(y|x)log(W(y|x)) 



2N 



- E Q( x )^(y l x ) £ p ( fi l x > y) lo s ( # ^l x 

x,y b 

+ 2^ £ ^( x )^(y i x ) £ ^ix, y) lQ g (^( fi i x ' y 



^7 Yl Q(*)W{y\x) Y, HUx, y) log (P(b|x, y)) . (105) 



2N 



The first and third terms on the right-hand side are only functions of Q, W, and W and are independent of W. 
Therefore, by combining them as c^\w), we can write 

(W, W) 4 C W (^) _ * £ Q(x)W(y|x) £ P(b|x, y) log (P(b, y|x)) . 



2iV 

x,y £ 

Now using the AF-FSMC's decomposition property ( [TBI for iy(b,y|x), we can simplify the surrogate function 
further to 

^\W,W) =cf ] (W) 

- 2^ E Q( x ) w (y i x ) E p ( 6 i x ' y) E lo s (^( ( 106 ) 

x,y £ ^gXjv 

" 2^ E ^(y l x ) E ^|x, y) £ log (W(yt\k)) . 
x,y b <eZw 

After a few manipulations, we obtain 

^\W,W) = cf\w) -J>g (W(s\s p ,x)) f( N \b) - ^£ lo S (W(y|6)) ff(6 )V ). (107) 

S b v 

where if (6) and T 4 (A °(&,y) were defined in (l43l and (l44l . respectively. 



C. Property 3 

Similarly to the proof of Property 3 of Lemma l20l the convexity of ^^\\V ,W) w.r.t. W is established by 
looking at the corresponding Hessian matrix and by verifying that it is positive semi-definite. We leave the details 
to the reader. 



Appendix V 
Proof of Lemma 

For ease of reference, let us repeat here the relevant surrogate function: 
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=L{ N) (v) - l^Q(x)^(y|x)fl 



2N 



*K§,x,y) 

Es^( § '. x .y) 



^(s,x,y) 

Es^( §/ > x 5 y) 



(108) 



A. Property 1 
This follows from standard KL divergence properties. 

B. Property 2 

Inserting the definition fj0]> of l[ N) (W) into (fl08i we obtain 

x,y s ^ 

where (5) is independent of v. Applying the product decomposition (T22l of u(s,x, y), and the relation- 
ship (125T ). /.<?., 

5(s,x,y) 



* f } («) = c[ N) (s) + ^E o w^w E v g( -£' y) s log ^ (§ ' x ' y) ) ' (109) 



£ s , i( i<,*,y) 

we get, after some simplifying steps, 

*™ (v,v) = c[ N) (v)+J2J2 ! °g y-) ) f < 



(110) 
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6 y- 



C. Property 3 

Similarly to the proof of Property 3 of Lemma l20l the concavity of (^j ^) w - r -t- v is established by looking 
at the corresponding Hessian matrix and by verifying that it is negative semi-definite. We leave the details to the 
reader. 



Appendix VI 
Proof of Lemma [ 

For ease of reference, let us repeat here the relevant part of the information rate lower bound and the 
corresponding surrogate function: 



,(JV)/« A 



^E(W(y) lo g(E^ x/ 'y))' 

y \§',x' J 
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A. Property 1 

This statement follows trivially from the definition of c 2 (v). 



B. Property 2 
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On one hand, taking derivatives of ^3 (£>,?)) w.r.t. v(b,y—) we obtain 



dv(b,y°) 



On the other hand, X2 (v) can be expanded to the following expression 
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where Cg (v) is independent of -0. At first sight, expression (11 15b looks more complicated than expression (|1 12b . 
however, the relevant gradient of expression (II 151) is easier to find. Namely, the gradient of the first term with 
respect to v is the zero vector and so we do not have to worry about it. Similarly, the third term is a KL divergence 
and so the gradient with respect to v at v = v is also the zero vector. Therefore, we only need to have a close 
look at the second term. Applying the product decomposition ((221) of v(s,x, y), and using ([2TT> . i.e., 

#(s,x,y) 



£ g , x ,-C(s',x',y) 



K(S,xy) 



we get 







dv(b,yD) 







dv(b, y^- 
1 



v(b,yH) 

which indeed agrees with the expression in (II 14b . 
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(116) 
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C. Property 3 

Similarly to the proof of Property 3 of Lemma l20l the concavity of ^ (^j ^) w - r -t- v is established by looking 
at the corresponding Hessian matrix and by verifying that it is negative semi-definite. We leave the details to the 
reader. 
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